I want to know your experience on what programming language to pick up when you are data miner/statistician in a full time job. Some companies asked for background in C++, Perl, or Visual Basic. Thanks.
I am not talking about packages like SAS which I am using it everyday. I am thinking about any real programming language that you can use to code different data mining algorithms. Any idea? Thanks.
It would depend on what activities your job requires.
If you do Text Mining, you will need Perl.
If you do graphics, you would need Java or C++ with the appropriate libraries.
Some shops use Excel for many things, even for advanced modeling. For that,
you would need VBA/Excel skills.
If you are a student now, looking for a good bet on what to learn for your indeterminate
You will also want to be able to program using the .NET paradigm, since
a lot of interesting stuff happens across distributed systems.
For your statistics, if your business has money for licenses, SPSS Clementine or
SAS Enterprise Miner would be nice. The SPSS comes as a student edition ($220),
but not SAS. Researchers in Data Mining with no budget use WEKA and/or R.
They add on their own code. Java is great for this extendable class of tools.
Thanks for your reply. I am currently working full time as a statistical analyst. I use SAS/STAT SAS/ETS for my work but SAS EM is too expensive for our company to license. I want to do maket basket analysis and have tried WEKA and R but they both ran into problem of memory limitation. Could you tell me what to do next?
I am getting my PhD in OR from UW-Madison. Hopefully I can finish it by this year end.
By the way, how is the data mining courses in central connecticut state university? Do you like it and would you recommend it to me? I am looking for online program like that.
Ruby, Python are becoming popular. I use Perl a lot, for text mining and web crawling, as well as for advanced data mining. I've written sophisticated approximate logistic / ridge regression procedures in Perl, that run very fast, are easy to code, read, test and document, and produce the same lift as traditional algorithms. But you need to be an expert in numerical analysis to do that.
google perl tutorial
google perl datamining tutorial
i guess you are able to find it yourself.
language choice IMHO depends on 2 factors:
1. (primary) what suits you (C(#/++/pure)/Java/Python/Perl are good choices for everything.
Matlab/SAS/SPSS+python extension are much much easier for data-mining (try to code logistic regression in C, it's quite hard and waiste of time).
2. (secondary) what suits to you company infrastructure - sometimes you integrate code to your companie's system. so it's good when you both use same language. e.g. if they use .NET, don't choose Java.
Thanks. The next question would be that for a people like me who have no access to SAS EM or SPSS Clementine, in order to do something like market basket analysis, what would be the next step? Can I just pick up Perl and code the algorithm using it? I tried market basket analysis using R and WEKA but I ran into memory limitation and couldn't get it run. Thanks.
Unashamedly biased view from a StatSoft employee ...
If your company has some budget for this work you could take a look at STATISTICA Data Miner's Sequence & Link Analysis module. It is cheaper than the other commercial alternatives you mentioned and could save you days or weeks of coding time.
weka/R and all other tools have limitations, but there are always ways how to eliminate them.
e.g. data preparation - group by, sampling, per-partes data analysis
why do you think your perl code will not have limitations? :)
The last time I saw some reliable pricing information, a single-user license of SPSS Clementine was 2 or 3 times the price of STATISTICA Data Miner and SAS-EM is a great deal more expensive, despite the fact that you get a much wider range of analytical functionality in STATISTICA.
You would need to contact your local StatSoft office to get a proper quote for STATISTICA - you can then make the comparison yourself. We have standard list prices and don't quote according to what we think the customer will be prepared to pay. If you are in the US, contact statsoft.com, or send me a private message and I will put you in touch with someone there.
Thanks Matt. I am in the US and I have heard that STATISTICA is way cheaper than other packages, e.g. SAS, SPSS, but has more functionalities. I would love to have it at my finger tips but the thing is that my company has frozen all extra expenses and thinks that SAS/STAT along is sufficient. I would love to try it out though. What is your email address?