A Data Science Central Community
Capital One UK’s Data Science team has been focused on move from proprietary (paid-for) software to open source for some time now.
There are several key benefits to making this change. Open source software is prevalent in academia which makes it much easier for our new starters to hit the ground running, building models and analysing data on day one with the company (the switch has also been a terrific development opportunity for my team to learn new skills). Our team now has greater…
ContinueAdded by Dan Kellett on July 21, 2017 at 1:30am — No Comments
By Dan Kellett, Director of Data Science, Capital One UK
Over the past few months my blogs have attempted to demystify some of the techniques used by Data Scientists to build models or process large amounts of data. For all the flashy techniques and algorithms this is not where Data Scientists spend 90% of their time. The hard yards of any analysis lies in…
ContinueAdded by Dan Kellett on August 9, 2016 at 1:00am — No Comments
By Dan Kellett, Director of Data Science, Capital One UK
Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.
What is HDFS?
When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm…
ContinueAdded by Dan Kellett on July 21, 2016 at 2:00am — No Comments
By Dan Kellett, Director of Data Science, Capital One UK
What are Neural Networks?
Neural Networks are a family of Machine Learning techniques modelled on the human brain. Being able to extract hidden patterns within data is a key ability for any Data Scientist and Neural Network approaches may be especially useful for…
ContinueAdded by Dan Kellett on July 5, 2016 at 10:47am — No Comments
What is Text Mining?
Text Mining is a general catch-all for a range of techniques for extracting information from text strings. Being able to extract, clean and summarize text data is a key ability for any Data Scientist. The following blog aims to highlight some of the process steps I use to clean text data as well as some summarization methods.
Initial cleaning
To illustrate some of the approaches to text…
ContinueAdded by Dan Kellett on June 14, 2016 at 8:06am — No Comments
What is Logistic Regression?
Regression is a modelling technique for predicting the values of an outcome variable from one or more explanatory variables. Logistic Regression is a specific approach for describing a binary outcome variable (for example yes/no). Let’s assume you are own a new boutique shop. You have a list of potential clients you are thinking of inviting to a special event with the aim of maximizing the number of sales – who should you invite? Data on…
ContinueAdded by Dan Kellett on May 26, 2016 at 9:56am — 1 Comment
What are Markov Chains?
A Markov chain is a random process with the property that the next state depends only on the current state. For example: If you have the choice of red or blue twice the process would be Markovian if each time you chose the decision had nothing to do with your choice previously (see diagram below). How can Markov Chains help us?…
ContinueAdded by Dan Kellett on May 3, 2016 at 1:05am — No Comments
What are Tree Methods?
Tree methods are commonly used in data science to understand patterns within data and to build predictive models. The term Tree Methods covers a variety of techniques with different levels of complexity but my aim is to highlight three I find useful. To set the problem up let’s assume we have a census dataset containing age, education, employment status and so on. Given all this information we want to see if we can predict whether a person…
ContinueAdded by Dan Kellett on April 12, 2016 at 1:33am — 1 Comment
What is MapReduce?
When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System) which may be a topic for a future blog. MapReduce is a framework for efficient processing using a parallel, distributed algorithm. Over the past 18 months we have used MapReduce for a variety of analytic…
ContinueAdded by Dan Kellett on March 21, 2016 at 9:11am — No Comments
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles