Subscribe to DSC Newsletter

R / Splus / SPSS Clementine / JMP / Salford Systems memory limitations

These products store your entire data set in memory (RAM), then process it. If your data has more than 500,000 rows (even after significant summarizing to reduce the size of the data set), it means that these tools will crash on most platforms. How do you get around this? Unless you use SAS or SQL Server Data Mining or a few other products (which ones?), my feeling is that you have to write your own code in a high level language such as C++ or Java (or Perl / Python if lots of string processing is required), combined with powerful sorting tools such as syncsort (this will help you work with small hash table or stacks), and powerful string matching tools such as grep.

How do you handle this problem? Do you proceed differently? Please don't tell me I should do sampling - I can not afford to do sampling on our very large data set because it is not well balanced.

Views: 3885

Reply to This

Replies to This Discussion

Have you looked well at Hadoop? With Hadoop streaming, you can use anything to analyze your data, as long as you can recenter the problem in a map-reduce paradigm. I write all my code in Ruby, and there are some good Python frameworks (eg Dumbo), but you could absolutely use R where it's a good match to the problem. Hadoop lets you scale out rather than up, and scale arbitrarily. (The parallelization of map-reduce is astonishingly near-linear).

I was able to turn the idle time of a computer lab at my university into a 70-machine cluster with ease, and amazon EC2 lets you pull down as many computers as you care to request for very little. (CPU time for a 20-machine cluster over a 2000-hour working year costs $4,000). Using one-off ruby scripts and Hadoop Pig I regularly query and model datasets with 80M row-cardinality in several dimensions. Up in this several-hundred-GB range it's clearly not interactive, but you can extract an appropriately reduced dataset to qualify your analysis and then run it at scale on the full mama pajama.
R and open source software are great tools. But they're limited with respect to volume.

I've found that SPSS gives you the best cost/benefit. Since V14, it's truly become industrial strength. And its limits are more of a function of the hardware than that of SPSS's limits.

For example, I've SPSS to process Experian's compiled file ... 120 million rows of data, with each record being fairly wide. No crashing at all.

Coding-wise, SPSS (and, of course, SAS) has a syntax language that's pretty powerful. You can also get the functions to write code for you that you can either use as is or modify.

While SPSS interacts with other languages, like Python, it'll do just about anything I need. As another example, over time, I've also written a lot of data quality, xform, and matching programs in SPSS that I can incorporate as well.
It is possible and sometimes very helpful to sample differentially from unbalanced data. At the extreme, take all of the least frequently occurring and a fraction of the others. Taking an equal number of each is the idea used in case-control studies, and is also the core of the idea which makes boosting work, so it may be worth experimenting with.
The open source data mining software RapidMiner can handle very large data sets and lets you freely choose between fast in-memory data mining and extremely scalable on-database data mining.

Among the users of RapidMiner in more than 40 countries are some of the world's largest companies, e.g.
* Lufthansa, the leading European airline,
* mobilkom austria, leading Austrian mobile phone service provider,
* Bank of America, leading US bank,
* BNP Paribas, leading European bank,
* Sanofi-Aventis, leading European pharma company,
* HP, Nokia, Philips, Miele, and many more.

Their data sets often include millions of transactions or records or text documents.

In addition to the RapidMiner Community Edition, which can be downloaded free of charge, there also is the RapidMiner Enterprise Edition with 64bit and multi-core-processor parallelization support as well as professional technical support with guaranteed response times.

For more information please visit: www.rapid-i.com
Vincent,

Save yourself the headache and get a desktop copy of SAS. I've comfortably processed 200+GB files on my desktop for clients without issues, or the need for extra workstations, or advanced hardware, etc. And the good thing is that SAS will have a very small memory footprint on your machine, something on the order of 200MB or so. So even if you have a couple of GB of RAM, you should be just fine.

Bill
Still, SAS (SAS/Base and SAS/Stat is minimum) is much more expensive than a high end PC, with lots of RAM. I would definitely go with a great PC with lots of RAM, since that is nice in many other situations, and then use one of the open source alternatives.
Agreed. With the cost of a SAS license you could setup a nice PC cluster farm with lots of RAM in each node.
PASW Modeler (formerly known as SPSS Clementine) does not store data in RAM (except for the data needed at any step in an algorithm). I'm not sure about the other products. PASW Modeler uses SQL optimization to have maximum benefits from the database containing the data (Oracle, SQL Server, DB2, Netezza, Terradata,....). It also has a transparent integration with in-database mining algorithms (SQL Server, Oracle, DB2) and scoring engines. Some of the appliactions of PASW modeler involve many millions of records (10m+).

Hi jaap Vink,

            Thanks for your good information. Iam making use of this PASW Modeler for predicting fraudulent customers. I have generated a model using around 3000 records of customers along with their fraud history and used this obtained model in a new stream for new single customer data. It is going to give predicted result(i..e T/F) and probability value. Am i doing in correct way in predicting new customer based on previous customers? if not please suggest  me a solution. Any suggestion may help us.

 

Thanks in Advance,

Ashok B.

Ashok --

     It sounds like what you're doing is correct.  Fraud can sometimes be tricky due to the low frequency of fraud (hopefully your business has only a handful of fraudulent customers).  

     Am I correct in assuming that for the 3000 customer records you're using to build this model, you have BOTH some customers who you know committed fraud and some customers who did not?  If this is the case, then you would have a binary (Yes/No) variable that indicates whether the customers committed fraud -- this is your "target" variable.  How many fraud and non-fraud customers are in your 3000 customer sample?  What types of variables are you using in the predictive model (e.g., is it transaction history, demographics, etc.)?  How you code your predictor variables can sometimes be important for both model building and model scoring.

     If you don't have a binary fraud target variable, there are still some good things you can do.  But I would need to know more about your data and business situation to help you.  You can contact me directly at [email protected] if you want to discuss this further.  We're a small analytic consulting firm, and we've conducted analytics to identify fraud for clients in several industries.

     Good luck!  --  Karl Rexer, PhD

Hi Karl,

          Thanks for your quick reply. Iam going to contact you through my personal-id, please respond for that one.

 

Thanks in Advance,

Ashok B.

Take a look at DataRush, a new product from Pervasive Software (http://www.pervasivedatarush.com). As a disclosure, I work for Pervasive on the DataRush product. It is a platform for building scalable applications that we are using to build out data mining operators/applications. It is built on dataflow concepts and so allows the pipelining of data through the system. As such, it can work on large amounts (millions, billions of records) without having to have the whole dataset in memory at one time.

We'll be at KDD in Paris later next month presenting a paper on our experience using DataRush to process the Netflix data.

RSS

On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service