Subscribe to DSC Newsletter

Software is serial or sequential computer by default. So, it is meant to be run on a single computer having a single CPU.

A problem is broken down into discrete modules/instructions. These instructions are then executed step by step. To exemplify, if I want to do (10 + 20 + 30), first 10 and 20 are added and the result is added to 30. It is thus broken down to multiple discrete steps and executed one by one.

Serial Computing

Limitations of Serial Computing

To put it all in one word, it’s Speed. The transmission limit of copper wire is 9cm/nanosecond. There’s only so many things you can do to speed that up. Moreover, it needs skilled expertise to improve single processor speeds.

Parallel Computing

Computations are carried out simultaneously. Concurrent use of multiple compute resources is leveraged to solve a computational problem. It runs on multiple CPUs and thus effectively utilizes all the resources a system has to offer. In today’s world, with the advent of multiple core CPUs, it is very essential to understand the essence of parallel computing and use it to our advantage to make our lives easier.

A Problem broken down to chunks are executed parallel, at the same time.

Parallel Computing

We have Hadoop frameworks for implementing distributed computing architectures. That is at enterprise level. But, data scientists have looked very little into utilizing their own system resources when they are not using distributed frameworks. So if they are working on small datasets or prototyping a model, taking advantage of an easy to use parallel framework can significantly speed up their computations. To know more about setting it up in Python, you can refer my previous post here.

Why Parallel Computing?

  • Save Time
  • Solve Larger Problems
  • Provide Concurrency
  • Use of commodity hardware to the fullest
    • Multiple Execution Units
    • Multi-core

Parallelization Overhead

It can be defined as the amount of time required to coordinate parallel tasks, as opposed to doing useful work. So, it’s very important to identify whether you need a parallel computing infrastructure in the first place, what are the implications. Overhead includes factors such as:

  • Task start-up time
  • Synchronizations
  • Data communications
  • Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
  • Task termination time

Understanding the problem

Undoubtedly, the first step in developing parallel software is to first understand the problem that you wish to solve in parallel. If you are starting with a serial program, this necessitates understanding the existing code also.

Before spending time in an attempt to develop a parallel solution for a problem, determine whether or not the problem is one that can actually be parallelized.

Example of a Non-parallelizable Problem

Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of the formula:

F(k + 2) = F(k + 1) + F(k)

This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones. The calculation of the k + 2 value uses those of both k + 1 and k. These three terms cannot be calculated independently and therefore, not in parallel.

Example of Parallelizable Problem

Transform a given description column to remove links, email ids and convert all words to lowercase for each data point.

This problem is able to be solved in parallel. Each data point is independently operable. So, data cleaning/transformation is a phase where parallelization can be considered.

References http://cdn.phys.org/newman/gfx/news/hires/2014/3-searchingfor.jpg for the cover image

Originally posted here.

Views: 1524

Reply to This

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service