A Data Science Central Community

Software is serial or sequential computer by default. So, it is meant to be run on a single computer having a single CPU.

A problem is broken down into discrete modules/instructions. These instructions are then executed step by step. To exemplify, if I want to do (10 + 20 + 30), first 10 and 20 are added and the result is added to 30. It is thus broken down to multiple discrete steps and executed one by one.

To put it all in one word, it’s Speed. The transmission limit of copper wire is 9cm/nanosecond. There’s only so many things you can do to speed that up. Moreover, it needs skilled expertise to improve single processor speeds.

Computations are carried out simultaneously. Concurrent use of multiple compute resources is leveraged to solve a computational problem. It runs on multiple CPUs and thus effectively utilizes all the resources a system has to offer. In today’s world, with the advent of multiple core CPUs, it is very essential to understand the essence of parallel computing and use it to our advantage to make our lives easier.

A Problem broken down to chunks are executed parallel, at the same time.

We have Hadoop frameworks for implementing distributed computing architectures. That is at enterprise level. But, data scientists have looked very little into utilizing their own system resources when they are not using distributed frameworks. So if they are working on small datasets or prototyping a model, taking advantage of an easy to use parallel framework can significantly speed up their computations. To know more about setting it up in Python, you can refer my previous post here.

- Save Time
- Solve Larger Problems
- Provide Concurrency
- Use of commodity hardware to the fullest
- Multiple Execution Units
- Multi-core

It can be defined as the amount of time required to coordinate parallel tasks, as opposed to doing useful work. So, it’s very important to identify whether you need a parallel computing infrastructure in the first place, what are the implications. Overhead includes factors such as:

- Task start-up time
- Synchronizations
- Data communications
- Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
- Task termination time

Undoubtedly, the first step in developing parallel software is to first understand the problem that you wish to solve in parallel. If you are starting with a serial program, this necessitates understanding the existing code also.

Before spending time in an attempt to develop a parallel solution for a problem, determine whether or not the problem is one that can actually be parallelized.

Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of the formula:

F(k + 2) = F(k + 1) + F(k)

This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones. The calculation of the k + 2 value uses those of both k + 1 and k. These three terms cannot be calculated independently and therefore, not in parallel.

Transform a given description column to remove links, email ids and convert all words to lowercase for each data point.

This problem is able to be solved in parallel. Each data point is independently operable. So, data cleaning/transformation is a phase where parallelization can be considered.

References http://cdn.phys.org/newman/gfx/news/hires/2014/3-searchingfor.jpg for the cover image

*Originally posted here.*

Tags:

© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions