A Data Science Central Community
Software is serial or sequential computer by default. So, it is meant to be run on a single computer having a single CPU.
A problem is broken down into discrete modules/instructions. These instructions are then executed step by step. To exemplify, if I want to do (10 + 20 + 30), first 10 and 20 are added and the result is added to 30. It is thus broken down to multiple discrete steps and executed one by one.
To put it all in one word, it’s Speed. The transmission limit of copper wire is 9cm/nanosecond. There’s only so many things you can do to speed that up. Moreover, it needs skilled expertise to improve single processor speeds.
Computations are carried out simultaneously. Concurrent use of multiple compute resources is leveraged to solve a computational problem. It runs on multiple CPUs and thus effectively utilizes all the resources a system has to offer. In today’s world, with the advent of multiple core CPUs, it is very essential to understand the essence of parallel computing and use it to our advantage to make our lives easier.
A Problem broken down to chunks are executed parallel, at the same time.
We have Hadoop frameworks for implementing distributed computing architectures. That is at enterprise level. But, data scientists have looked very little into utilizing their own system resources when they are not using distributed frameworks. So if they are working on small datasets or prototyping a model, taking advantage of an easy to use parallel framework can significantly speed up their computations. To know more about setting it up in Python, you can refer my previous post here.
It can be defined as the amount of time required to coordinate parallel tasks, as opposed to doing useful work. So, it’s very important to identify whether you need a parallel computing infrastructure in the first place, what are the implications. Overhead includes factors such as:
Undoubtedly, the first step in developing parallel software is to first understand the problem that you wish to solve in parallel. If you are starting with a serial program, this necessitates understanding the existing code also.
Before spending time in an attempt to develop a parallel solution for a problem, determine whether or not the problem is one that can actually be parallelized.
Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of the formula:
F(k + 2) = F(k + 1) + F(k)
This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones. The calculation of the k + 2 value uses those of both k + 1 and k. These three terms cannot be calculated independently and therefore, not in parallel.
Transform a given description column to remove links, email ids and convert all words to lowercase for each data point.
This problem is able to be solved in parallel. Each data point is independently operable. So, data cleaning/transformation is a phase where parallelization can be considered.
References http://cdn.phys.org/newman/gfx/news/hires/2014/3-searchingfor.jpg for the cover image
Originally posted here.