Subscribe to DSC Newsletter

For those of us working with real data, data validation/cleansing is the most time consuming part of the work. Data validation is an important part of any statistical modeling since invalid data results in invalid models no matter how superior your modeling techniques are. Many of us rely on their own codes to go through the data and validate it. My question is: Has anyone come across a software package that can help in data validation. I do realize that a general rule based package can be utilized to do so, but is there something specifically for data validation?

Views: 412

Reply to This

Replies to This Discussion

I've built a "fuzzy logic" transformation using Matt Casters Kettle (ETL) program that works reasonably well at cleaning data. I want to develop a machine learning version in the future.
What's the application/ context?
Thanks for your reply Robin.

The question is rather general not specific to an application. If it is for one specific application then normal routine data validation would do it, but I am wondering whether there are general purpose data validation packages that can be tailored to specific applications.
Kettle is probably the most powerful data app that I've encountered - at least among the free ones! It can be tailored to input/ output in any format (csv, xls, mdb, mysql, xml ... etc).

The "first best" solution would be to ensure valid input in the first place - this will depend upon the platform you're using. Without wishing to sound patronising, the best way to codify data is to use relational databases as the codes <> descriptions are portable (e.g. unlike SPSS to my knowledge).

If you're talking about retrospective classification (i.e. of open-ended answers) then perhaps you should think about text parsers that tag keywords. I've got a rudimentary excel formula that compares fields with the column heading, you just choose keywords and add them as column headings to codify open text.

We have developed our in-house data validation and quality check software, which has evolved over the past 7 years into an extremely reliable solution to us. Indeed, it is marketed and sold externally too.

- Fractal Analytics, India, Singapore, US (
Can you provide more info about your solution: Is it part of a stats package or stand-alone, what language, any related white paper, price, trial availability etc.

Thanks a lot!
Hi Mehran, You may explore a product STATISTICA 9.0 for a full Analytical cycle i. Data Acqusition ii. Data Cleaning,transforming,validation & prepartion. iii. Statistical Principles & algarithm iv. Report (Disemination of Information).

You may explore for more details in depth.

You may reach us for further information & assistance . [email protected]

I know that KXEN has a powerful tool for data mining. In particular they say that they have tools for automatic data cleansing, data validation, etc. However, I have never used the tool myself.
I've used KXEN. It has very powerful and robust general methods for transforming numeric data. As near as I can tell from reading the scoring code, it does fairly complicated linear splines. However, as an automated modeling package it has it's weaknesses. I've had it return a "best model" with the client's first name as a critical predictive variable. or ZIP code as a categorical variable.

I've found KXEN to be very good -- once one has done basic variable selection and data validation oneself, outside of KXEN.
Thanks for sharing. It seems that you used KXEN for modeling, but how about data validation? Are you saying it's not suitable for data validation or you didn't use it that way?
I didn't use it for data validation, but given that I felt that KXEN's claims for automatic model building were rather off, I would not take any claims of automatic data validation at face value.
We had used QualityStage to analyze and improve data quality. This tool is a part of Ascential's (now IBM) data integration suite.
One of the best tools I've worked with for data validation is dfPower Studio by Dataflux. Data validation is just one part of the package. It can also do fuzzy merge matching, standardizing of data (e.g state and city names), outlier detection, as well as implementing your own rules basic logic.

It also does ETL, and master data management.

Ralph Winters


On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service