Subscribe to DSC Newsletter

For those of us working with real data, data validation/cleansing is the most time consuming part of the work. Data validation is an important part of any statistical modeling since invalid data results in invalid models no matter how superior your modeling techniques are. Many of us rely on their own codes to go through the data and validate it. My question is: Has anyone come across a software package that can help in data validation. I do realize that a general rule based package can be utilized to do so, but is there something specifically for data validation?

Views: 411

Reply to This

Replies to This Discussion

There are quite a few products out there so it depends what you mean by validation. If you're talking about data cleanup or transformation and mapping, the best thing would be to look at what's used in the data warehousing and master data management markets. Search for terms like "extract transform load" or ETL, data quality, master data, etc. and you'll find lots of products.

Expensive DQ products that can standardize and validate addresses, assign gender to names, etc. are available (like Dataflux, Informatica's quality module, IBM's QualityStage). Open source ETL where you build your own validations and mappings are useful: Talend, Pentaho (Kettle), Apatar. SAP/Business Objects has nice tools with what they call "Data Services".
Lousy data (or even no data at all, no statistical modeling) combined with deep domain knowledge is better than perfect data and sophisticated predictive modeling developed by data miners lacking domain expertize.

Cleaning the data is of course necessary, but identifying the right metrics, even well before you build a database, is absolutely critical. Ideally, you want to work with people that are both good statisticians and senior product managers at the same time. These people are difficult to find.

You may explore StatSoft family of products STATISTICA for an Unified Engine for an Analytical Cycle, 1. Data Acquisition (From most of the source & Format) 2. Data Cleansing, Transformation & Validation 3. Analytical Algorithm (Approx 14,000 Statistical Routines & Sub Routines) iv. Preparation of Reports (in widely used format) on the Top of that you may explore STATISTICA ETL almost most of the solution under one Publisher.

may explore for more in depth & details.



On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service