A Data Science Central Community
The advent of so many noticeable tools and technologies for handling BigData problems has made the lives of a lot of people and organizations easier. A lot of these are open source, they have good support, good community and are pretty active. But there is another aspect of it. When things become easy, free, with good support and in abundance, we often start to over-utilize them. Having said that, I would like to share one incident.
We organize Hadoop meetups here in Bangalore(India). In one of the initial meetings we just decided to exchange views with each other on how we are using Hadoop, and other related projects. There I noticed that a lot of folks were either using or planning to use Hadoop for problems which could easily be solved using traditional systems. In fact they could be solved in a much better and efficient way. There was absolutely no need to use Hadoop for these kind of problems. So, it raised question in my mind. The question was, are we really getting the 'point'. To me it seems like those folks were trying to stitch a piece of cloth using a sword.
From my experience, I have learned one thing. Even if we have the strongest of weapons we can't win a battle if we are not using it at the right spot at the right time. Same holds good for the industry. Normally we tend to use a particular 'thing' for all our needs, if we find that it had worked for us in the past. There is no harm in it. This is human tendency to try to make things swift. But this doesn't work always. Same is the case when it comes to BigData.
First of all, BigData is not an absolute term. It is rather relative. Relative to the resources that we have. For example 1PB might be big enough for me, but for an internet giant, say Google, it is still not that big. So how to decide whether the data which I am going to handle qualifies to be called BigData or not. The thumb rule is that once you cross the threshold after which you are not able to handle the data, which you have, with the help of resources and system you already have, you can assume that your data has grown into BigData. But, in the process we should always keep one thing in mind. Are we really able to exploit the resources we already have. Not to offend anyone, but I have seen it a couple of times that folks are not using their systems to the fullest and turning towards rather new, and meant for completely different systems, to solve their issues.
For instance if somebody wants to run real time ad-hoc queries over his or her 1TB data set, he or she could do it pretty efficiently using MySQL. Planning to use Hadoop or Hbase in such a situation makes no sense. Moreover it would be wastage of systems and resource, atleast in my view.
Long story short, 'think well before you act'. Analyze your data and the requirements properly and then conclude whether you are really gonna face BigData issues. Because, 'with BigData, comes big responsibilities'.