A Data Science Central Community
Are there any book about popular programming languages (Python, Java, Hadoop, SQL, R etc.) that every data scientist should know? I'm talking about a 500-page book that has about 100 pages per language, presented in a very concise way, and also discussing how these languages interact.
Anybody interested in writing such a book? I think the potential for revenue is high. So far, multi-language books focus on specific topics and specific programming languages environment e.g.
The only book I found that is truly a book on multiple languages is "Handbook of Programming Languages" (4 volumes, 1998 so it's quite old now) and it only had two reviews on Amazon - one of them was very negative.
Do programming language books need to be like natural language books - focusing on just one language? I understand that a book about how to learn Spanish, French, German and Russian might not be successful, but what about how to learn Unix, Excel, R, SQL, Python, Java and Hadoop?
PS: I plan to write 10 pages on programming languages in my training manual to become a data scientist, but it will be very concise and most likely point to external references. This booklet (once written) is not an answer to the problem discussed here.
I think that there is a common construct to learning all programming languages, and that is adopting a logical methodology in which most of the preparation is done before one ever writes a line of code.
Pragmatic Programmers has a book about learning Python, written by educators from the University of Toronto that comes close to what your goal is--can't remember the title right now. And Simon Allardice has an excellent tutorial on Lynda.com about programming fundamentals.
It would be nice to have a book that shows how to integrate R, SQL, and Python together seamlessly -- that is, explain when to use which language and how to get them to play together nicely.
Sounds like a great idea. The book should also cover functional and logical languages as well like:
- Haskell: http://tryhaskell.org/
- Scala: http://www.scala-lang.org/
---- and GPU Programming in:
- OpenCL: http://www.khronos.org/opencl/
And Microsoft Base Database and Analysis Technologies - MS SQL Server 2012/2008R2
- SSRS (SQL/RDL, SQL Server Report Builder 3 (and above))
- SSAS (MDX, DMX, XMLA)
- PowerPivot and DAX within MS Excel
- VBA & VSTO
Recommendation would be one or more books on a single given language. To many languages in one book is just window shopping in a blurry window ... little to know value add. You may want to create a ebook base on HTML 5 technologies that goes beyond the old vanilla of books. Computer books that will also grab the eyes and minds of the next generation of STEM students. A zygotebody (http://www.zygotebody.com/#nav=-2.84,105.12,225.73) like computer series of books.
Just my two cents.
There definitely needs to be a "rosetta stone" (the real one, not the spoken language company) for analytical languages -- to convert between them. E.g. coming from a SQL background, when learning R, it took me a while to figure out how to do "what was easy in SQL" in R. As an example, how to do "GROUP BY HAVING" in R, which I describe in
A lot of people also having strong Excel, so a book on how to do Excel-type thing in other languages would be useful. E.g. pivot tables, identifying outliers, etc.
Regex is a commonality amongst most of that list, and a multi-language book could leverage a single regex chapter to cover its various implementations (sed/awk, R, Java, etc.)
And how to create interactive models as in Excel: when you change one (or more) cell value in a spreadsheet, the whole model automatically re-compute itself in real-time (sometimes it can be slow though with large spreadsheets - like 4 minutes), including goodness of fit, regression coefficients, plots etc. Are there other environments offering this nice Excel feature?
Such a comparative programming language book would have to get into the competing philosophies of imperative vs. functional vs. constraint vs. data flow vs. query/4GL. It's "deep philosophy", but I think the audience is up for it, and hiding such "details" would only muddle the matter.
Excel is somewhat like constraint or functional, and could be translated to other constraint or functional languages, and also to a limited extent to data flow languages.
There is one project similar to rosetta stone, called rosetta code. I believe it might be helpful.
Also, should we include Matlab, Informatica and SAS? Data science disciplines are still very separated, with Operations Research, Six Sigma, Quant, Statistical Science, Quality Assurance, Risk Management, Decision Science etc. using their own set of programming languages to solve the exact same problems:
- SAS for statisticians
- C++ for Quant
- Matlab for Operations Research
I would say yes. Showing people how to do the same thing in other languages is not a bad idea either, especially if it improves the performance of a model OR application.
Vincent, take a look at the following book:
"Seven Languages in Seven Weeks: A Pragmatic Guide to Learning Programming Languages" by Bruce A. Tate
This book is not about analytics but it may provide a valuable insight into what to write about every language.
In our field, in my opinion, Python, R, Java and SQL are programming languages that it is necessary to know, though at work analysts are typically specializing in one language (for me, it is currently Python). My personal preferences also include C# (rising star to eclipse C++) and its more functional cousin F#. If I had more time, I would love to try Scala as one day it may become a successor of Java.
From one of the reviews:
"Outright Disappointment: I wish that the individual chapters went into significantly more depth comparing the motivations for and consequences of each language design. While the key features of each language are demonstrated with annotated code samples and explanatory text little is offered in the way of discussion comparing across language. For example the Scala chapter (selected at random) is on pages 121-166 in the index under "Scala" the only references outside its own chapter are found on pages 302, 303, 305-306, and 308 (all in the final wrap-up chapter). I view this as a real missed opportunity given the books unique approach/content. The final wrap-up chapter seems to be the only place with this sort of cross-language discussion and as a result it is both excellent and much too short."
In contrast, I would think a book on comparative analytical languages might be half chapter-per-language and half-chapter-per-task.
I wish we had a book like that for data science, focused on active languages.
Not sure if it is possible to write 10 pages that will teach you about Programming, Computer science, Python, Java, R, SQL, MapReduce, Unix, Excel. But that's my plan.
Maybe it's possible - I once read a 15-page syllabus called "scratch course on time series", written by my mentor at Cambridge University. It contained as much material as in a 5,000-page college book aimed at US students. Indeed it contained even more stuff - advanced stuff that no students will ever learn here, such as extreme value theory, MCMC applied to geo-spatial point processes, all in 15 pages. The first page was about very introductory concepts...It was all very well written and not difficult to read (unlike books written by French scientists).
Also, I once trained a business analyst colleague (familiar with SQL) to write her queries from a Perl script on a Unix machine, rather than using Brio on Windows: the Perl would read the SQL code and produce the output into a text file. She learned basics about Unix, FTP, Putty and Perl in about 15 minutes, and then was able to process her queries 10 times faster, and to process much bigger outputs (previously she was limited to what the browser could hold in memory).