A Data Science Central Community

Capital One UK’s Data Science team has been focused on move from proprietary (paid-for) software to open source for some time now.

There are several key benefits to making this change. Open source software is prevalent in academia which makes it much easier for our new starters to hit the ground running, building models and analysing data on day one with the company (the switch has also been a terrific development opportunity for my team to learn new skills). Our team now has greater…

ContinueAdded by Dan Kellett on July 21, 2017 at 1:30am — No Comments

By Dan Kellett, Director of Data Science, Capital One UK

*Over the past few months my blogs have attempted to demystify some of the techniques used by Data Scientists to build models or process large amounts of data. For all the flashy techniques and algorithms this is not where Data Scientists spend 90% of their time. The hard yards of any analysis lies in…*

Added by Dan Kellett on August 9, 2016 at 1:00am — No Comments

By Dan Kellett, Director of Data Science, Capital One UK

*Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.*

* *

*What is HDFS?*

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm…

ContinueAdded by Dan Kellett on July 21, 2016 at 2:00am — No Comments

By Dan Kellett, Director of Data Science, Capital One UK

*What are Neural Networks?*

Neural Networks are a family of Machine Learning techniques modelled on the human brain. Being able to extract hidden patterns within data is a key ability for any Data Scientist and Neural Network approaches may be especially useful for extracting…

ContinueAdded by Dan Kellett on July 5, 2016 at 10:47am — No Comments

*What is Text Mining?*

Text Mining is a general catch-all for a range of techniques for extracting information from text strings. Being able to extract, clean and summarize text data is a key ability for any Data Scientist. The following blog aims to highlight some of the process steps I use to clean text data as well as some summarization methods.

* *

*Initial cleaning*

* *

To illustrate some of the approaches to text…

ContinueAdded by Dan Kellett on June 14, 2016 at 8:06am — No Comments

*What is Logistic Regression?*

Regression is a modelling technique for predicting the values of an outcome variable from one or more explanatory variables. Logistic Regression is a specific approach for describing a binary outcome variable (for example yes/no). Let’s assume you are own a new boutique shop. You have a list of potential clients you are thinking of inviting to a special event with the aim of maximizing the number of sales – who should you invite? Data on…

ContinueAdded by Dan Kellett on May 26, 2016 at 9:56am — 1 Comment

*What are Markov Chains?*

* *

A Markov chain is a random process with the property that the next state depends only on the current state. For example: If you have the choice of red or blue twice the process would be Markovian if each time you chose the decision had nothing to do with your choice previously (see diagram below). How can Markov Chains help us?…

ContinueAdded by Dan Kellett on May 3, 2016 at 1:05am — No Comments

*What are Tree Methods?*

* *

Tree methods are commonly used in data science to understand patterns within data and to build predictive models. The term Tree Methods covers a variety of techniques with different levels of complexity but my aim is to highlight three I find useful. To set the problem up let’s assume we have a census dataset containing age, education, employment status and so on. Given all this information we want to see if we can predict whether a person…

ContinueAdded by Dan Kellett on April 12, 2016 at 1:33am — 1 Comment

*What is MapReduce?*

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System) which may be a topic for a future blog. MapReduce is a framework for efficient processing using a parallel, distributed algorithm. Over the past 18 months we have used MapReduce for a variety of analytic…

ContinueAdded by Dan Kellett on March 21, 2016 at 9:11am — No Comments

- Open sourcing 'spot the difference'
- Making data science accessible – Data Munging
- Making data science accessible – HDFS
- Making data science accessible – Neural Networks
- Making data science accessible – Text Mining
- Making data science accessible – Logistic Regression
- Making data science accessible - Markov Chains

- Making data science accessible – Logistic Regression
- Making data science accessible – Text Mining
- Making data science accessible - Markov Chains
- Making data science accessible - MapReduce
- Making data science accessible – Neural Networks
- Making data science accessible - Machine Learning – Tree Methods
- Making data science accessible – HDFS

© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions