A Data Science Central Community
After using a lot of R for analytics projects believing that it was the best language for Data Scientists, I recently had the chance to pick up Python. R does seem a bit cumbersome when dealing with interfaces to other languages or to the web such as oauth. That was my motivation, to use Python to get text from the web and later process it in R, which was, I felt the "best" tool to go about.
However, Python surprised me not only with it's web interfacing abilities, but also with it's analytical features. It got me thinking, at a lot of points, why I was still using R when Python could do is so much more elegantly.
So here are some points where I found Python really useful. In a way, this is my version of an answer to the question Python vs R:
1. Interfaces - Like I mentioned before, the number of interfaces and wrappers in Python are huge when you compare it to R. (E.g.: Apache Spark has a direct Python interface while with R, you'd need to configure a wrapper named SparkR.) In some cases though, R is pretty good such as Jeff Gentry's twitteR package which is amazing.
2. Handling Large Data - Now, this is one problem all R programmers face and everyone seems to talk about RAM at some point. One option is to use H2O...I didn't find it very easy to use though it's much easier than the typical Big Data frameworks. With Python, not only do you have more interfaces to big data, but also more options to read data or even a CSV line by line. It could be used to build amazing algorithms such as the one that google built for CTR prediction - Google's Whitepaper
3. The code - I remember people talking about the learning curve in R. With python, the syntax is so readable, it almost feels like it's given you the ability to run algorithm descriptions/ pseudo codes. Warnings are a bit limited though and you can still build infinite loops in Python. The data types in Python are a bit more primitive and you feel the need to have something like a Data frame. This is where the python package "pandas" comes in. Pandas gives you R - like (sometimes better) flexibility with Data frames. One thing I didn't like about Python though was it's interface with installing packages. You can't install it easily through any of the IDEs. They do have a package named "pip" which you could use from the command line. Also, unlike R versions, Python has a v2.7.x where most of the present packages run and a v3.4 where nothing really runs but they still want everyone to start using eventually.
4. The models - Finally, that's all that we praise R for really. The tools available such as random forests, gradient boosting, glm and gam. While these were built in R earlier, with python, you have the package "scikit - learn" which gives you all of these models (I haven't explored this exhaustively, but all the models I typically use are available in python). In addition, changing python source code is much easier in case you want to build custom models over those already built. What amazed me was the python visualizations. These are as good as R and you could also choose a custom plotting tool such as Qt as well.
Overall, I am now using Python at places where I find R cumbersome and have also started using it where it's convenient. I'm new to Python and have posted most of the good things I found about it. I'll possibly write about the shortcomings in subsequent blogs as I explore further.