Subscribe to DSC Newsletter

Top 50 data science / big data techniques, described in less than 40 words, for decision makers. Please help us: any definition that you fill will have your name attached to it: send your definition or new term and definition to [email protected] See also The Data Science Alphabet.

  1. Adjusted R^2 (R-Square): The method preferred by statisticians for determining which variables to include in a model. It is a modified version of R^2 which penalizes each new variable on the basis of how many have already been admitted. Due to its construct, R^2 will always increase as you add new variables, which result in models that over-fit the data and have poor predictive ability. Adjusted R^2 results in more parsimonious models that admit new variables only if the improvement in fit is larger than the penalty, which improves the ultimate goal of out-of-sample prediction. (Submitted by Santiago Perez)
  2. Bayesian Networks
  3. Boosted Models
  4. Cluster Analysis: Methods to assign a set of objects into groups. These groups are called clusters and objects in a cluster are more similar to each other than to those in other clusters. Well known algorithms are hierarchical clustering, k-means, fuzzy clustering, supervised clustering. (submitted by Markus Schmidberger)
  5. Cross Validation
  6. Decision Trees: A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3. (Submitted by Michael Malak)
  7. Design of Experiments
  8. EM Algorithm
  9. Ensemble Methods
  10. Factor Analysis: used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)
  11. Feature Selection
  12. General Linear Model
  13. Goodness of Fit: The degree to which the predicted values created by a model minimizes errors in cross-validation tests. However, over-fitting the data can be dangerous, as it results in a model that will have no predictive power on fresh data. True Goodness of Fit is determined by how the model fits new data, ie its predictive ability. (submitted by Santiago Perez)
  14. Hadoop: Hadoop is an Open Source framework that supports large scale data analysis by allowing one to decompose questions into discrete chunks that can be executed independently very close to slices of the data in question and ultimately reassembled into an answer to the question posed. (submitted by Philip Best)
  15. Hidden Decision Trees
  16. Hierarchical Bayesian Models
  17. K-Means: Popular clustering algorithm where for a given (a priori) K, finds K clusters by iteratively moving cluster centers to the cluster centers of gravity and adjusting the cluster set assignments. (Submitted by Michael Malak)
  18. Kernel Density estimator
  19. Linear Discrimination
  20. Logistic Regression
  21. Mahout: Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.
  22. MapReduce: Model for processing large amounts of data efficiently. Original problem is "mapped" to smaller problems (which may themselves become "original" problems). Smaller problems are processed in parallel. Results of smaller problems are combined, or "reduced", into solution to original problem. (submitted by Melanie Jutras)
  23. Maximum Likelihood
  24. MCMC
  25. Mixture Models
  26. Model Fitting
  27. Monte-Carlo Simulations: Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to compute probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty. (submitted by Renato Vitolo)
  28. No SQL: "Not only SQL" is a group of database management systems. Data is not stored in tables like a relational database and is not based on the mathematical relationship between tables. It is a way of storing and retrieving unstructured data quickly. (submitted by Markus Schmidberger)
  29. Multidimensional Scaling: reduce space dimension by projecting a N*N  (N = number of observations) similarity matrix into a 2-dimensional visual representation. Classical example is producing a geographic map with cities, when the only data available is travel times between any pair of cities. (submitted by Vincent Granville)
  30. Naive Bayes
  31. Non-parametric Statistics
  32. Pig: Pig is a scripting interface to Hadoop, meaning a lack of MapReduce programming experience won't hold you back. It's also known for being able to process a large variety of different data types.
  33. Principal Component Analysis
  34. Sensitivity Analysis'
  35. Six Sigma
  36. Stepwise Regression: Variable selection process for multivariate regression. In forward stepwise selection, a seed variable is selected and each additional variable is inputed into the model, but only kept if it significantly improves goodness of fit (as measured by increases in R^2). Backwards selection starts with all variables, and removes them one by one until removing an additional one decreases R^2 by a non-trivial amount. Two deficiencies of this method are that the seed chosen disproportionately impacts which variables are kept, and that the decision is made using R^2, not Adjusted R^2. (submitted by Santiago Perez)
  37. Supervised Clustering
  38. Support Vector Machines
  39. Time Series: A set of (t, x) values where x is usually a scalar (though could be a vector) and the t values are usually sampled at regular intervals (though some time series are irregularly sampled). In the case of regularly sampled time series, the t is usually dropped from the actual data, replaced with just a t0 (start time) and delta-t that apply to the whole series. (Submitted by Michael Malak)

Related articles:

Reference: Introduction to Machine Learning, Ethem Alpaydin, The MIT Press (2004)

Views: 18863


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Gee Tee on July 1, 2015 at 6:22pm

Depending on the level of exposure and experience of the decision makers and other interested users of this Data Science dictionary, it may be beneficial to provide practical illustrations to a selection of the definitions.....that is, if the 40 word restriction would permit it....!

Comment by Donald Krapohl on June 11, 2013 at 10:57am

Principal Component Analysis (PCA) - technique used to analyze and weight predictors to predetermine optimal model descrimination
Sensitivity Analysis - process used to determine the sensitivity of a predictive model to noise, missing data, outliers, and other anomalies in the model predictors

Comment by Tom Miller on May 6, 2013 at 1:24pm

22. Maximum Likelihood - see end results of "34. Stepwise Regression".  (probably not either precise enough or unpacked enough as a definition).

Comment by Tom Miller on May 6, 2013 at 1:12pm

33. "Six Sigma is a set of tools and strategies for process improvement originally developed by Motorola in 1985....Six Sigma seeks to improve the quality of process outputs by identifying and removing the causes of defects (errors) and minimizing variability in manufacturing and business processes" excerpted from: 

I have been studying Six Sigma through a couple different courses.  This is a good short summary.  The details are more extensive of course.  So the question really is which version of "Six Sigma" are we looking for here :)

Comment by Vincent Granville on April 20, 2013 at 12:27pm

A few new terms (Source:

1. Hadoop: System for processing very large data sets
2. HDFS or Hadoop Distributed File System: For storage of large volume of data (key elements – Datanodes, Namenode and Tasktracker)
3. MapReduce: Think of it as Assembly level language for distributed computing. Used for computation in Hadoop
4. Pig: Developed by Yahoo. It is a higher level language than MapReduce
5. Hive: Higher level language developed by Facebook with SQL like syntax
6. Apache HBase: For real-time access to Hadoop data
7. Accumulo: Improved HBase with new features like cell level security
8. AVRO: New data serialization format (protocol buffers etc.)
9. Apache ZooKeeper: Distributed co-ordination system
10. HCatalog: For combining meta store of Hive and merging with what Pig does
11. Oozie: Scheduling system developed by Yahoo
12. Flume: Log aggregation system
13. Whirr: For automating hadoop cluster processing
14. Sqoop: For transfering structured data to Hadoop
15. Mahout: Machine learning on top of MapReduce
16: Bigtop: Integrate multiple Hadoop  sub-systems into one that works as a whole
17. Crunch:  Runs on top of MapReduce, Java API for tedious tasks like joining and data aggregation.
18. Giraph: Used for large scale distributed graph processing

Also, embedded below is an excellent TechTalk by Jakob Homan of LinkedIn on the subject explaining these tech terms.

Comment by Janet Dobbins on November 29, 2012 at 2:48pm

5.  Cross-Validation:  Cross-validation is a general computer-intensive approach used in estimating the accuracy of statistical models. The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times - with each of the N subsets being used as the validation set . Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error.

The cross-validation is used in various classification and prediction procedures, such as regression analysis , discriminant analysis , neural networks and classification and regression trees (CART) .

7. Design of Experiments:  Design of experiments is concerned with optimization of the plan of experimental studies. The goal is to improve the quality of the decision that is made from the outcome of the study on the basis of statistical methods, and to ensure that maximum information is obtained from scarce experimental data.

If the decision making process is based on statistical hypothesis testing (e.g. on analysis of variance ), then the goal is to increase the power of the statistical test . If the decision making process is based on estimation of the parameters of interest (e.g. using regression analysis ), then the goal is to increase the precision of the estimates of the parameters derived from the outcome of the experiment.

The term"design" is also used to refer to the specific plan of the experiment that has been obtained in course of the designing procedure. See, for example, crossover design , parallel design , self-controlled design , complete block design .

See also: variables in the design of experiments , general linear model for a latin square .

12. General Linear Model:  General (or generalized) linear models (GLM), in contrast to linear models, allow you to describe both additive and non-additive relationship between a dependent variable and N independent variables. The independent variables in GLM may be continuous as well as discrete. (The dependent variable is often named "response", independent variables - "factors" and "covariates", depending on whether they are controlled or not).

Consider a clinical trial investigating the effect of two drugs on survival time. Each drug is tested at three levels - "not used", "low dose", "high dose", and all the 9 (=3x3) combinations of the three levels of the two drugs are tested. The following general linear model might have been used:

Yij = A + B X + Ci + Dj + Rij + N; i,j = 1,2,3;

where Y is survival time (response), i and j correspond to the three levels of drug I and drug II respectively, X is age, Ci are additive effects (called "main effects") of each level of drug I, Dj are main effects of drug II, Rij are non-additive effects (called interaction effects or simply "interactions") of drugs I and II, N is random deviation.

We have here three independent variables: two discrete factors - "drug I" and "drug II" with three levels each, and a continuous covariate "age".

In this particular case, because each of the two factors (drugs) has a zero level i,j=1 ("not used"), main effects C1, B1, and interactions R1j, j=1,2,3; Ri1, i=1,2,3 are zeros. The remaining unknown coefficients - A, B, Ci, Dj, Rij - are estimated from the data. The main effects Ci, Dj of the two drugs and their interaction effects Rij are of primary interest. For example, their positive values would indicate a positive effect - longer survival time due to use of the drug(s).

20. Logistic Regression:  Logistic regression is used with binary data when you want to model the probability that a specified outcome will occur. Specifically, it is aimed at estimating parameters a and b in the following model:

Li = log  pi

= a + b xi,

where pi is the probability of a success for given value xi of the explanatory variable X.

Use of the log of the odds p/(1-p) (the logit) guarantees that the predicted value of p will always be between 0 and 1.

See also: Regression analysis.

29.  Naive Bayes Classification:  The Naive Bayes method is a method of classification applicable to categorical data, based on Bayes theorem . For a record to be classified, the categories of the predictor variables are noted and the record is classified according to the most frequent class among the same values of those predictor variables in the training set. A rigorous application of the Bayes theorem would require availability of all possible combinations of the values of the predictor variables. When the number of variables is large enough, this requires a training set of unrealistically large size (and, indeed, even a huge training set is unlikely to cover all possible combinations). The naive Bayes method overcomes this practical limitation of the rigorous Bayes approach to classification.

The major idea of the naive Bayes is to use the assumption that predictor variables are independent random variables . This assumption makes it possible to compute probabilities required by the Bayes formula from a relatively small training set.


Principal components analysis:  The purpose of principal component analysis is to derive a small number of linear combinations (principal components) of a set of variables that retain as much of the information in the original variables as possible. This technique is often used when there are large numbers of variables, and you wish to reduce them to a smaller number of variable combinations by combining similar variables (ones that contain much the same information).

Principal components are linear combinations of variables that retain maximal amount of information about the variables. The term "maximal amount of information" here means the best least-square fit, or, in other words, maximal ability to explain variance of the original data.

In technical terms, a principal component for a given set of N-dimensional data, is a linear combination of the original variables with coefficients equal to the components of an eigenvector of the correlation or covariance matrix. Principal components are usually sorted by descending order of the eigenvalues - i.e. the first principal component corresponds to the eigenvector with the maximal eigenvalue.

33. Six-Sigma: Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any given point in time if the mean of the sample at that time is within six standard deviations of the overall process mean to that point. In this case, "standard deviation" means the standard deviation of the sample mean. Six sigmas (= six standard deviations) is a very broad range, and the use of six-sigmas, rather than 3-sigmas, was popularized by Motorola. It poses substantial demands on the manufacturing process to limit variability of output so that a six-sigma-wide band lies within the limits of an acceptable process.

Here is our Glossary:

Comment by Carl Wimmer on November 18, 2012 at 8:44am

Correlation - emulating human behaviour, correlation allows a user to traverse the complete knowledge payload of any corpus, producing  exhaustive sets of pathways in response to N-Dimensional Queries.

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service