Subscribe to DSC Newsletter

Dan Kellett's Blog (9)

Open sourcing 'spot the difference'

Capital One UK’s Data Science team has been focused on move from proprietary (paid-for) software to open source for some time now.

There are several key benefits to making this change. Open source software is prevalent in academia which makes it much easier for our new starters to hit the ground running, building models and analysing data on day one with the company (the switch has also been a terrific development opportunity for my team to learn new skills). Our team now has greater…

Continue

Added by Dan Kellett on July 21, 2017 at 1:30am — No Comments

Making data science accessible – Data Munging

By Dan Kellett, Director of Data Science, Capital One UK

Over the past few months my blogs have attempted to demystify some of the techniques used by Data Scientists to build models or process large amounts of data. For all the flashy techniques and algorithms this is not where Data Scientists spend 90% of their time. The hard yards of any analysis lies in…

Continue

Added by Dan Kellett on August 9, 2016 at 1:00am — No Comments

Making data science accessible – HDFS

By Dan Kellett, Director of Data Science, Capital One UK

 

Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.

 

What is HDFS?

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm…

Continue

Added by Dan Kellett on July 21, 2016 at 2:00am — No Comments

Making data science accessible – Neural Networks

By Dan Kellett, Director of Data Science, Capital One UK

 

What are Neural Networks?

 

Neural Networks are a family of Machine Learning techniques modelled on the human brain. Being able to extract hidden patterns within data is a key ability for any Data Scientist and Neural Network approaches may be especially useful for extracting…

Continue

Added by Dan Kellett on July 5, 2016 at 10:47am — No Comments

Making data science accessible – Text Mining

What is Text Mining?

 

Text Mining is a general catch-all for a range of techniques for extracting information from text strings. Being able to extract, clean and summarize text data is a key ability for any Data Scientist. The following blog aims to highlight some of the process steps I use to clean text data as well as some summarization methods.

 

Initial cleaning

 

To illustrate some of the approaches to text…

Continue

Added by Dan Kellett on June 14, 2016 at 8:06am — No Comments

Making data science accessible – Logistic Regression

What is Logistic Regression?

 

Regression is a modelling technique for predicting the values of an outcome variable from one or more explanatory variables. Logistic Regression is a specific approach for describing a binary outcome variable (for example yes/no). Let’s assume you are own a new boutique shop. You have a list of potential clients you are thinking of inviting to a special event with the aim of maximizing the number of sales – who should you invite? Data on…

Continue

Added by Dan Kellett on May 26, 2016 at 9:56am — 1 Comment

Making data science accessible - Markov Chains

What are Markov Chains?

 

A Markov chain is a random process with the property that the next state depends only on the current state. For example: If you have the choice of red or blue twice the process would be Markovian if each time you chose the decision had nothing to do with your choice previously (see diagram below). How can Markov Chains help us?…

Continue

Added by Dan Kellett on May 3, 2016 at 1:05am — No Comments

Making data science accessible - Machine Learning – Tree Methods

What are Tree Methods?

 

Tree methods are commonly used in data science to understand patterns within data and to build predictive models. The term Tree Methods covers a variety of techniques with different levels of complexity but my aim is to highlight three I find useful. To set the problem up let’s assume we have a census dataset containing age, education, employment status and so on. Given all this information we want to see if we can predict whether a person…

Continue

Added by Dan Kellett on April 12, 2016 at 1:33am — 1 Comment

Making data science accessible - MapReduce

What is MapReduce? 

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System) which may be a topic for a future blog. MapReduce is a framework for efficient processing using a parallel, distributed algorithm. Over the past 18 months we have used MapReduce for a variety of analytic…

Continue

Added by Dan Kellett on March 21, 2016 at 9:11am — No Comments

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service