None of tools mentioned in this poll are useful enough to be used for my projects. They are all 2-3 generations behind of Qlikview and (in less degree) of Spotfire (Spotfire includes S+ and interface to R, so you can include into my list the R then). However recently I started to use Excel 2010 with PowerPivot, which are very good front-end for SSAS (SQL Server 2008 R2 Analysis Services) - for some clients who do not need advanced data visualization, the multidimensional cubes and PivotTables can be enough and price-wise they are less expensive than Qlikview and Spotfire much more productive than so called "free" or "open source" tools. I regard my time as more expensive resource than money spent for my tools.
How were the 8 named tools selected to be the response items in this poll? It seems odd to me that some of the tools that rank highest in other polls and surveys are not listed by name here (e.g., SPSS, SAS, Weka, and Clementine/Modeler). Unfortunately, if the two OTHER response categories have large number of responses, it could make the results of this poll un-interpretable. E.g., if 50% of respondents select the OTHER category, we'll never know if hidden in those responses are very large numbers of people who use these other tools. If the other category has lots of responses, it's even possible that a few of the most commonly used tools are hidden in those responses.
I suggest looking at several years of the results of KDnuggets polls and Rexer Analytics Annual Data Miner Surveys BEFORE selecting the response items in polls like this one.
I like polls, Vincent, but I'm sorry to say that I think there's been a mistake.
When I follow your link, the top 10 items I see in the KDnuggets poll are:
- 38% - Rapidminer
- 30% - R
- 24% - Excel
- 19% - KNIME
- 18% - Your own code
- 14% - Pentaho / Weka
- 12% - SAS
- 9% - MATLAB
- 8% - IBM SPSS Statistics
- Tie for 10th place: 7% for IBM SPSS Modeler and 7% for "Other free tools"
However, in this AnalyticBridge poll, I see THREE items that are NOT in the KDnuggets top 10:
- RapidInsight - 0% in KDnuggets poll
- StatSoft / Statistica - 6% in KDnuggets poll (13th place)
- SAS Enterprise Miner - 5.5% in KDnuggets poll (15th place)
And the AnalyticBridge poll is missing THREE items that ARE in the KDnuggets top 10:
- Pentaho / Weka - 14% in KDnuggets poll (6th place)
- SAS - 12% in KDnuggets poll (7th place)
- IBM SPSS Statistics - 8% in KDnuggets poll (9th place)
(I can see why Clementine needed to be dropped (tied for 10th place), so an OTHER category could be created.)
And "Your own code" is combined together with "other" in the AnalyticBridge poll, even though it got 18% in the KDnuggets poll (5th place).
FYI, the Rexer Analytics Annual Data Miner Survey has a question that is worded slightly differently, but the top ranking tools in the 2009 Survey were:
- 47% -- SPSS -- 249 respondents
- 38% -- R -- 201 respondents
- 38% -- SAS -- 199 respondents
- 31% -- SPSS Clementine -- 163 respondents
- 30% -- Matlab -- 161 respondents
- 27% -- Statistica (Statsoft) -- 142 respondents
- 27% -- Microsoft SQL Server (data mining functions) -- 141 respondents
- 25% -- Weka -- 132 respondents
- 23% -- SAS Enterprise Miner -- 121 respondents
- 20% -- Mathematica -- 104 respondents
- 16% -- Rapid Miner (formerly YALE) -- 87 respondents
- 15% -- C4.5 / C5.0 / See5 (RuleQuest Research) -- 78 respondents
- 13% -- Oracle Data Mining -- 69 respondents
- 13% -- Business Objects / NetWeaver (SAP) -- 69 respondents
- 13% -- Minitab -- 69 respondents
- 10% -- Cognos -- 53 respondents
- 13 more tools were selected by 2-9% of respondents.
- A couple dozen more tools were selected by < 2% of respondents.
N=529 (137 Academic + 392 Industry) Respondents could select multiple tools.
Software tool vendors were removed for this analysis.
In a separate question, the Rexer Analytics surveys also ask respondents what their one primary data mining tool is. These annual surveys also ask about many other topics (algorithms, top challenges, impact of economy, goals, fields, etc). The 2010 Data Miner Survey is still open now, and over 800 data miners globally have participated so far. If any AnalyticBridge readers haven't participated yet in the 2010 Data Miner Survey, please see Vincent's May 25th AnalyticBridge posting about it: http://www.analyticbridge.com/forum/topics/data-mining-survey, or go directly to the Survey link: www.RexerAnalytics.com/Data-Miner-Survey-2010-Intro2.html.
The free 48 page Summary Report for the 3rd Annual Data Miner Survey is available now. If you want to receive a copy, just email us at [email protected]
I believe that your position #1 for SPSS could mean that your sample was drawn from a slightly different population: more CRM / survey or market research people, who tend to use SPSS more than other analytics professionals. Of course, many of the people who answered "other" in our poll could be SPSS users.
Since this topic (Which data mining/analytic tools you used in the past 12 months for a real project?) creates a lot of interest, we might run another survey with more choices, and target our entire network (25,000 - 2x the size of KDNuggets) rather than just AnalyticBridge members (7,500).
The purpose of our poll was to add a dimension to the KDNugget poll: a breakdown per country and state. And also, to check whether our numbers are consistent with KDNuggets statistics.
I like the ideas of regional breakdowns and seeing how the larger member group wold respond. Great ideas.
Yes, I agree with you, our Rexer Analytics Data Miner Survey sample is probably somewhat different. I agree with the sample characteristics that you mention, and I'd add that perhaps the rexer sample is also less IT-focused than analyticbrdge.
Since the sample and question was a bit different than the analyticbridge question, I was just posting the rexer survey results as a point of comparison. It is totally separate from my main point about the analyticbridge 10 items not matching my reading of the KDnuggets top 10 items.
Yes, I've also found that software questions always seem to raise lots of interest and discussion. I'm not sure why. I wish that survey info about other analytic topics would engage people in discussion as much.
I think that analyticbridge has done a great job of creating a community nad engaging a very large group in ongoing discussions on a variety of analytic issues.
OK, the EXCEL reference did it. I know the question asks what dm tool you've used in the past 12 months for real projects. However, with this astute dm audience, I did not want to stay too silent for too long.
I'm a vendor. I head up the Product Management for big bad Oracle's (the company who has been gobbling up every other application sw company out there and who owns nearly 50% of all data stored in relational databases) data mining technologies.
Over the past 11 years, since Oracle's acquisition of Thinking Machines Corporation, our Data Mining Technologies Dev. team has been steadily "stem-celling" traditional and cutting edge data mining algorithms (e.g. clustering, GLM regression, decision trees, SVMs, association rules, text mining, ability to mine star schemas, push-down SQL scoring to Exadata storage layers, etc.) and statistical functions (e.g. t-test, F-test, ANOVA, Pearson's, etc.) inside the SQL kernel of the Oracle Database.
While this may not be of interest to our Excel user friends, it is of significant interest to most large corporations who manage their data inside a relational database and who are tired of paying SAS's high annual usage fee (AUFs) ransoms. Oracle's RDBMS strategy has been to “move algorithms to the data" rather than the traditional "move the data to the algorithms" approach. As the volumes of data explode, and as your become aware of data security issues and/or want to deploy your new business intelligence throughout the enterprise, it just makes better sense.
Basically, the Oracle Data Mining (Option to the Oracle Database Enterprise Edition - what most corporations use) algorithms are built totally inside the SQL kernel of the Database so you can think of our having expanded SQL's vocabulary beyond SELECT, WHERE, GROUP_BY, SORT, etc. to add PREDICT, DETECT, CLUSTER, CLASSIFY, ASSOCIATE, etc. That's a bit simplifying, but generally accurate.
Why is this important? First of all, it simplifies EVERYTHING, enables data analysts better and more productive access to more data (under the control of the DBA so you get their blessing and support) and all your ODM models and results remain inside the database. Sure, you can move them, grant privs for someone else to use them, etc. What's more interesting is how this next-generation architecture opens new doors for data analysts to build, embed and deploy predictive analytics applications (I'll call them PA Appls). That's what we've been working on at Oracle and that's why you see the major traditional dm/stats vendors e.g. SAS and SPSS on multi-year journeys to embed their traditional "numerical recipes" inside other people's (exluding SPSS + IBM) databases. We've already announced Oracle Sales Prospector, Oracle Spend Classification, Oracle Retail Data Model, Oracle Communications Data Model PA Appls and stay tuned. At Oracle World in September, you will see more unveiled. Also, get ready for the totally new Oracle Data Miner 11gR2 workflow GUI http://blogs.oracle.com/datamining/2010/02/get_ready_for_the_new_or... that ships (for free download from the Oracle Technology Network) w/ SQL Developer 3.0 this Sept.
Love the idea of this "crowd sourced" poll. However, the "Other" category is much too broad to be useful. SPSS, an IBM company, should stand as its own category, as should Oracle -- and Microsoft. Also, there should be a generic SQL choice. You'd be surprised how commonly-used SQL is for "data mining". What about Spotfire, which combines lots of regression capabilities with its visualizations?