I really liked the reviews of the datamining products and programs. I am still at a loss though. I never formally studied datamining and it seems that that is a business I have been in for close to 10 years. I collect and sell data to the auto industry. I collect prices on services and products like tires that a dealership would sell as well as a all the aftermarket (Sears, Firestone, Pep Boys, etc.). I gather this informaiton through my people and sell it to the industry. I have collected data on American, import and even high-end imports on basic services such as oil changes, front-end alignments, tires and more. I have over 3 million real prices within my database for just about every vehicle ever made.
Looking through the reviews, I am a litte lost on which one would work well for me, if any. I am wanting to offer the media and others some conclusions of the data I have gathered for my clients over the past 10 years. Can you help?
What is the best product for processing large data sets (50MM rows, 20 fields) in batch mode, very efficiently? What do you recommend for real time processing? How much does SAS Enterprise Miner costs? Is it worth it?
I am not very experienced in competing statistical and data mining softwares, but we have review demonstrating, STATISTICA Data Miner is about twice faster in comparison to SAS Enterprise miner.
If I remember, SDM even costs substantially less than SAS EM. I am consultant of StatSoft of Czech Republic, therefore, I am not objective, of course.. :-)
We use a combination of SQL Server and R. SQL Server because of its Relational DBMS technology is very fast in sifting thro' millions of records and processing them on the fly. Its good for aggregations, math calculations, cleaning, de-duping etc. either on a single data set of by joining multiple data sets with a primary foreign key.
Once the data is prepared using SQL Server, we read it into R for doing any statistical mining. Some analysis require data in a multi-dimensional matrix (e.g., array[1000,400,3]) for which R is good. These two products can be integrated seamlessly without the need of extracting the processed data into an intermediate file. Also if the raw data is in flat files or some other format, it can easily (and automatically) be coded to read into SQL Server every time there is a change in the input data set.
SQL Server is affordable. R is free. The cost is nothing compared to SAS or Statistica. Also now there are mining algorithms built into SQL Server suite called SQL Server Analysis Services.
I tried to run market basket analysis using R but because of memory limitation, it didn't go through. I know that SQL Server Analysis Services comes with this analysis and I am curious if you have done this kind of analysis using SQL Server Analysis Services. Did you ever have problem with R because your data was too big for R to handle?
great Kiran! what library do you use on r? i have the same setup as yours, having ms sql server. and i dont like data mining on ms sql as the implementation of regression model is in neural network, which makes explanation of statistical model very difficult.