Subscribe to DSC Newsletter

R / Splus / SPSS Clementine / JMP / Salford Systems memory limitations

These products store your entire data set in memory (RAM), then process it. If your data has more than 500,000 rows (even after significant summarizing to reduce the size of the data set), it means that these tools will crash on most platforms. How do you get around this? Unless you use SAS or SQL Server Data Mining or a few other products (which ones?), my feeling is that you have to write your own code in a high level language such as C++ or Java (or Perl / Python if lots of string processing is required), combined with powerful sorting tools such as syncsort (this will help you work with small hash table or stacks), and powerful string matching tools such as grep.

How do you handle this problem? Do you proceed differently? Please don't tell me I should do sampling - I can not afford to do sampling on our very large data set because it is not well balanced.

Views: 3917

Reply to This

Replies to This Discussion

If you have a huge data set, then in most cases it makes no sense to process it all, since unless you have sequential dependency in the data (when most statistical methods cannot be used anyway), properly applied sampling is the answer and will produce the same results in minutes and not hours or days.

If you feel strongly about processing all your data, then SAS Enterprise Miner will not be of much help (and will quit pretty quickly), also Clementine will not cut it, at least not until they release a 64-bit version later this year (they have only a 32-bit version now). You will either need to write your own code, use the (limited) data mining procedures embedded in SQL Server or Oracle, or use Statistica Data Miner (which is 64 bit, and can scale in parallel to as many CPU's as you have installed on the server or cluster or servers).
Oracle Data Mining (ODM) can handle huge datasets. It can be run on a variety of hardware, or on the Amazon cloud. There are no memory limits. ODM is just an option on the database (and is often installed by default). It has 12 data mining algorithms. I've been using SVMs, Naive Bayes, Decision trees, K-Means clustering, and logistic regression. Many traditional statistical procedures are also available. ODM also makes use of Oracle Text, so many of the ODM algorithms can accept unstructured text as input variables.

A friend of mine probably has some performance numbers at his fingertips. I'll ask him to provide a response to this discussion including number of cases, number of variables (columns), name of data mining procedure, hardware and run-time.

In the meantime, here are a few links:
-- Oracle ODM product page:
-- ODM blog:
-- Oracle statistical functions;

I hope this helps you

Good luck !
The anecdoral story I remember was when we built a Logistic Regression model on 100K documents (MEDLINE) abstracts. Oracle Text had created 800K tokens (words) which we pushed throught the ODM SVM model. On a single cpu, it build a very good model in 8 minutes.

ODM provides real-world scalability—available for mission critical appls e.g.:
- Fast scoring: 2.5 million records scored in 6 seconds on a single CPU system
- Real-time scoring: 100 models on a single CPU: 0.085 seconds

Oracle Data Mining takes advantage of all the Oracle Database's parallism, memory management, etc. A data mining model is a schema object in the database, built via a PL/SQL API and scored via built-in SQL functions.

When building models, Oracle Data Mining functions leverage existing scalable technology
(e.g., parallel execution, bitmap indexes, aggregation techniques) and add new core database technology (e.g., recursion within the parallel infrastructure, IEEE float, etc.)

Performance benchmarks in data mining are difficult due to the exact dataset used and other variables, but we've published some performance papers (attached). Excerpting just a bit, on a 4 cpu/7 GB mem Linux Red Hat environment, and with 252 variables, 100K dataset, ODM builds models:

Naive Bayes - 16 seecs
Support Vector Machine - 150 secs
K-Means - 118 secs

These numbers are from 2006 running on 10g (we're at 11gR2) and ignore Exadata "smart scan" SQL model push down, but when the paper was witten, we scored 100 decision tree models on 100M records in 5 hrs 24 mins. One the model has been built, asking the database to "score" a record with a previously built dm models add about the same overhead of calculating a square root. ODMs can realistically be used in OLTP transactional applications.

See attached tech performance papers posted at

Among the many benefits, the Oracle environment: eliminates data movement, eliminates data duplication, and preserves security.

By building and scoring data mining models natively inside the database, the data remains in the database for fewer moving parts and shorter information latency. Ir provides a excellent platform for developing PA/DM Applications and enables straightforward inclusion within interesting and arbitrarily complex queries e.g.:

“SELECT Customers WHERE Income > 100K, AND PREDICTION_PROBABILITY(Buy Product A) > .85;”

In-DB model building and scoring enables pipelining of results without costly materialization.

The true power of embedding within the database is evident when scoring models using built-in SQL functions (incl. Exadata). In the SQL apply query below, the churnmod row-wise function is push down to the storage (hardware) layer for execution before results are pulled inside the database for futher query processing.

select cust_id
from customers
where region = ‘US’
and prediction_probability(churnmod, ‘Y’ using *) > 0.8;

Competitive alternatives, e.g. SAS + Netezza, partnerships are attempting to create similar peformance numbers by counting on User Defined Functions for everything on specialized hw), but you don't achieve the tighted integration, elimination of data movement, in-DB model build and apply, in-DB data transforms using SQL, etc. that you realize by having built data mining from the inside out of the database. Hope this helps.

Charlie Berger
Sr. Dir. PM, Oracle Data Mining Technologies
Wanted to add this diagram of the Exadata "smart scan" model scoring where ODM models get executed at the hw storage layer. Pushing models to Exadata storage has yielded a 2-5X performance improvement. 2X for pushing to hw storage and 1-nX more peformance depending on the # of JOINS used to create the VIEW or TABLE that is scored.

You can use 64bit versions of R or Splus. Your limitation is the amount of RAM, so 64G or 128G are utilized easily. Go with a Linux box if you can especially if you want to use Splus. I have processed data sets with over 100MM rows.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service