Subscribe to DSC Newsletter

Does anyone have experience with the performance of data mining and machine learning algorithms in R? I have been collaborating with OpenBI and they indicated that memory limitations severely impair what you can do with R.

I am interested in using R, or modifying R, to mine TB datasets and was wondering if others have thought about or have direct experience trying to do this.

Tags: R, cloud computing, data mining, machine learning

Views: 705

Replies to This Discussion

I have used the clara algorithm in the cluster library on very large datasets. As long as I stay under 200 clusters, I haven't had too much trouble with several hundred thousand records.

Regarding machine learning, I have found that the algorithms tend not to converge to a global minimum when I set up too many hidden nodes.

I have had more luck using the projection pursuit regression (ppr) algorithm when trying to model complex structures.
Do you have more details (time so solution, total memory in your machine, etc)?

I just ran a couple of simple experiments and I can't even get a 1 million point lm to complete and I have a 2G machine. I want to get a sense of how fast/slow R is compared to C++ for example.
You might be interested in the ff package, which handles large vectors better than normal R objects. I haven't used it myself yet, but found the following poster to be pretty informative: http://user2007.org/program/posters/adler.pdf. It makes use of flatfiles rather than the everything-in-RAM model.

Their CRAN page had more recent info.
Excellent reference, thank you. I did see the flat file package, but didn't put one and one together. Memory mapping a data set is one aspect, now reformulating algorithms so that they work on chards is another.

Anyone know of streaming extensions in R to take in live data feeds?
I've heard about something called R Clusters (if I remember correctly), where R processing is distributed across several machines.
Very unfortunate that the word "cluster" is ambiguous particularly in R. I can only find statistical clustering help inside R.

I did find the following reference:

SNOW: Simple Network of Workstations

It allows you to program your own cluster algorithm which in my mind is the wrong approach: the cluster execution must sit as an operator underneath the language. It didn't really matter since I cannot find SNOW in any of the CRAN archives anyway, despite what the documentation/papers say.

There is a proof of concept out there with REvolution Computing's modification of R to seamlessly build distributed data structures. They have the above abstraction with commands that execute in parallel, such as bootNWS for a parallel bootstrap, or forestNWS for a parallel classification.
I found the snow package on the first CRAN mirror I looked at:

http://mirrors.ibiblio.org/pub/mirrors/CRAN/web/packages/snow/index...

I tried setting this up on a small cluster of Amazon EC2 Ubuntu images one time with little luck, though I didn't put much time into it. This is something I'd really like to get working at some point.
Ian:

Continued research on this has surfaced a couple of interesting activities.

- The OLAP/data warehouse space has a handful of startups that deliver high-performance column oriented databases. The list I have so far is:

- Netezza hardware accelerated SQL processor with lot's of I/O parallelism
- KickFire also hardware accelerated SQL but in addition they have
very large memory config using MetaRAM that effectively holds
the database in memory
- Vertica software only column oriented database that already runs on EC-2/S3
- DataScaler stealth mode startup that appears to do the same as KickFire

Also, got deeper into R and I am pretty much convinced that the R engine needs to be modified to do web-scale data mining through its language. We have a proof of concept here in the form of Revolution Computing, but I don't like their back-end technology since I think it is not solving the right problem.

Finally, made some head way on the business side as well. It appears that the companies that have a big problem to solve, are also constrained by the IT reliability curse. Otherwise stated, their IT infrastructure is large, complex, and mostly commercial (Oracle, SAP, Microsoft, IBM). So there are two problems to overcome: how to get approval to get data out of these systems into a cloud DB, and secondly how to manage the aggregation given that the data needs to come from a multitude of systems. In this regard, I found Cast Iron Systems, which offers a on-premise or cloud aggregation service.

So far, we have not found the right decision makers at any Fortune-500 that would enable us to demonstrate the 10x cost reductions that we believe we can deliver. I think we have a handle on the technology, our biggest hurdle is to find a company that has a big enough problem to solve and I mean dollars here. For example, we did a comparison of a Google search appliance and for a typical mid-market archival solution we were able to reduce the complexity from a 60 node cluster to a 3 node cluster.

It would be interesting to see if by posting this analysis here on Analytic Bridge, we can develop connections to these decision makers. So here is the request to the AnalyticBridge community: If anyone is connected to innovative CIOs that have a complex data mining problems to solve, please let them know of our research. I am sure that if they become aware of what is possible that they can quickly see the cost savings they will be able to realize.

Theo
My friend Paco and me had a very interesting discussion on this topic, when we met last time in Seattle.
Dear Omtzigt ;
Hope this will help you.
All my best

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service