Subscribe to DSC Newsletter

R / Splus / SPSS Clementine / JMP / Salford Systems memory limitations

These products store your entire data set in memory (RAM), then process it. If your data has more than 500,000 rows (even after significant summarizing to reduce the size of the data set), it means that these tools will crash on most platforms. How do you get around this? Unless you use SAS or SQL Server Data Mining or a few other products (which ones?), my feeling is that you have to write your own code in a high level language such as C++ or Java (or Perl / Python if lots of string processing is required), combined with powerful sorting tools such as syncsort (this will help you work with small hash table or stacks), and powerful string matching tools such as grep.

How do you handle this problem? Do you proceed differently? Please don't tell me I should do sampling - I can not afford to do sampling on our very large data set because it is not well balanced.

Views: 4012

Reply to This

Replies to This Discussion

KNIME does not have the memory limitation. Its native processing nodes can handle arbitrarily large data files as long as they fit on disk somewhere. KNIME is open source and supports open standards like the Predictive Model Markup Language (PMML) which allows users to exchange models with various other tools like R, SPSS, SAS, etc.

The PMML export allows users to also deploy models instantly in a production environment, e.g., using the ADAPA scoring engine, for real-time or batch scoring and integration with other systems. Zementis also provides a free PMML converter to move older PMML exports to the latest 3.2 format of the PMML standard and a support blog covering articles related to Predictive Analytics and the PMML standard.
KNIME is not open source software according to the OSI ( definition, because it descriminates commercial use.
500,000 rows isn't a large data set by today's standards, and on modern machine with at least 2Gb of RAM I'd expect in-memory applications like R to handle a data set like that with no problems whatsoever. (It depends how many columns you have, of course.)

Even if the data set is large, you're probably not going to use the entire data set for the final analysis. Do the data preprocessing (column and row selection) in an external database, and load in the data for analysis directly from there. (R for example has commands to read data directly from relational databases.) This usually happens implicitly in SAS at the DATA STEP; the PROC that is doing the analysis typically only sees a fraction of the original data file.

Finally, 64-bit systems eliminate many of the limitations of 32-bit systems, which are often limited to 2 or 3Gb of usable memory (regardless of how much is actually installed). Upgrade to a 64-bit system and statistics application, and you'll find you can immediately process much larger data sets. It's not commonly known, but you'll get that benefit even with the same amount of RAM as an equivalent 32-bit system. 64-bit systems can address much larger virtual memory spaces (but adding more RAM will probably make things run faster).
Let's say that you get a machine with 64GB RAM. Can R or Perl (Perl's famous efficient hash tables that crashes when their reach 4MM entries) can efficiently take advantage of this amount of RAM? Or are they somehow limited and unable to use this RAM potential?
As long as you have a 64 bit CPU!

I've installed a network of 64bit servers (running Debian GNU/Linux) to run R on larger data, each server with "just" 32GB RAM, but capable of 128GB RAM each (physical, and cost, limitation I think). R can take advantage of all the memory and more (i.e., virtual memory). W can now load and analyse much larger datasets within Rattle. Empirically, can load many millions of rows. Loading data is not usually a problem, but rather the algorithms being used to analyse the data and how efficiently they handle data.
Did you say you use or are considering JMP (SAS's statistical analysis tools)?

We've deployed several 64bit servers loaded up with 32GB Ram - for our particular datasets we've processed ~70 Million+ rows, you do need to run the 64Bit version of JMP however.
A slight correction: SAS procedures are smart enough to only read the variables that are actually used for each analysis. SAS users don't have use the DATA step to filter out variables before conducting an analysis.
Hey, get your facts straight!

SPSS Clementnine does *not* "store your entire data set in memory (RAM)". Regardless of your source data format. I don't know about the other applications you mention, but I don't *think* Salford stores all data in memory.

A lot of customer focused data mining is done using simple SQL, so most database platforms are ok for scaleable processing.

Sorry I sorry rude, but maybe you should try using the commerical tools. I get the impression you haven't. If cost is an issue, then SQL Server is probably your best bet. Sure it requires some programming, and a lot more time and effort, but you could get the same results in the end.

You should take a look at Debellor, new data mining framework designed exactly to solve the problem mentioned by Vincent. Thanks to stream-oriented architecture Debellor enables you to run sophisticated analysis while avoiding full data materialization and memory overflow.

Note that the problem of memory overflow is very common in many data mining tasks. Even when data are small at the beginning of analysis, they may suddenly "explode" at an intermediate stage - this is very typical in mining time series or images, where all possible windows of a specified length must be produced from a single series or image, giving rise to a hundred- or thousand-fold increase in total data size. In such case, even swapping data to disk (by OS or internally like in KNIME) can't help, the only solution is to produce and consume samples on the fly, which is possible only in stream-oriented architecture.

You can read more in the recent paper: M.Wojnarski, Debellor: a Data Mining Platform with Stream Architecture, Transactions on Rough Sets IX, LNCS 5390, pp. 405-427, 2008

Blue Sky Technology is planning to release some statistical analysis software in the near future. This system will not have the problems with data capacity that you've mentioned.

If you could tell me the processing or analysis functions that you're planning to use I will let you know if these tools would be suitable.

Mark McIlroy
You could try using the ff package in R, or the BigData library in Splus. Both will handle datasets by using chuncks of the data instead of the complete data.

If you have 500,000 rows and 100 columns you have 50.mln data points, assuming all doubles of 8 bytes then you would have 400 MB. It is large, but not that extreme to crash most platforms...

Just curious, why would sampling from not well balanced datasets not work?
I'd like to avoid sampling because my database has a limited number of clients, even thought it has a large number of observations. Also, I can't sample observations because each observation is part of a user session, and I need entire user sessions. I also need entire IP subnets, entire user agent data, entire referrer data... this makes sampling very complicated. For instance subnets span across multiple referrers, sessions span across multiple subnets, etc. I need to compute stats such as unique user agents per subnet, etc.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service