A Data Science Central Community
Top 50 data science / big data techniques, described in less than 40 words, for decision makers. Please help us: any definition that you fill will have your name attached to it: send your definition or new term and definition to [email protected] See also The Data Science Alphabet.
Related articles:
Reference: Introduction to Machine Learning, Ethem Alpaydin, The MIT Press (2004)
Comment
Depending on the level of exposure and experience of the decision makers and other interested users of this Data Science dictionary, it may be beneficial to provide practical illustrations to a selection of the definitions.....that is, if the 40 word restriction would permit it....!
Principal Component Analysis (PCA)  technique used to analyze and weight predictors to predetermine optimal model descrimination
Sensitivity Analysis  process used to determine the sensitivity of a predictive model to noise, missing data, outliers, and other anomalies in the model predictors
22. Maximum Likelihood  see end results of "34. Stepwise Regression". (probably not either precise enough or unpacked enough as a definition).
33. "Six Sigma is a set of tools and strategies for process improvement originally developed by Motorola in 1985....Six Sigma seeks to improve the quality of process outputs by identifying and removing the causes of defects (errors) and minimizing variability in manufacturing and business processes" excerpted from: http://en.wikipedia.org/wiki/Six_sigma
I have been studying Six Sigma through a couple different courses. This is a good short summary. The details are more extensive of course. So the question really is which version of "Six Sigma" are we looking for here :)
A few new terms (Source: http://hkotadia.com/archives/5427)
1. Hadoop: System for processing very large data sets
2. HDFS or Hadoop Distributed File System: For storage of large volume of data (key elements – Datanodes, Namenode and Tasktracker)
3. MapReduce: Think of it as Assembly level language for distributed computing. Used for computation in Hadoop
4. Pig: Developed by Yahoo. It is a higher level language than MapReduce
5. Hive: Higher level language developed by Facebook with SQL like syntax
6. Apache HBase: For realtime access to Hadoop data
7. Accumulo: Improved HBase with new features like cell level security
8. AVRO: New data serialization format (protocol buffers etc.)
9. Apache ZooKeeper: Distributed coordination system
10. HCatalog: For combining meta store of Hive and merging with what Pig does
11. Oozie: Scheduling system developed by Yahoo
12. Flume: Log aggregation system
13. Whirr: For automating hadoop cluster processing
14. Sqoop: For transfering structured data to Hadoop
15. Mahout: Machine learning on top of MapReduce
16: Bigtop: Integrate multiple Hadoop subsystems into one that works as a whole
17. Crunch: Runs on top of MapReduce, Java API for tedious tasks like joining and data aggregation.
18. Giraph: Used for large scale distributed graph processing
Also, embedded below is an excellent TechTalk by Jakob Homan of LinkedIn on the subject explaining these tech terms.
5. CrossValidation: Crossvalidation is a general computerintensive approach used in estimating the accuracy of statistical models. The idea of crossvalidation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times  with each of the N subsets being used as the validation set . Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error.
The crossvalidation is used in various classification and prediction procedures, such as regression analysis , discriminant analysis , neural networks and classification and regression trees (CART) .
7. Design of Experiments: Design of experiments is concerned with optimization of the plan of experimental studies. The goal is to improve the quality of the decision that is made from the outcome of the study on the basis of statistical methods, and to ensure that maximum information is obtained from scarce experimental data.
If the decision making process is based on statistical hypothesis testing (e.g. on analysis of variance ), then the goal is to increase the power of the statistical test . If the decision making process is based on estimation of the parameters of interest (e.g. using regression analysis ), then the goal is to increase the precision of the estimates of the parameters derived from the outcome of the experiment.
The term"design" is also used to refer to the specific plan of the experiment that has been obtained in course of the designing procedure. See, for example, crossover design , parallel design , selfcontrolled design , complete block design .
See also: variables in the design of experiments , general linear model for a latin square .
12. General Linear Model: General (or generalized) linear models (GLM), in contrast to linear models, allow you to describe both additive and nonadditive relationship between a dependent variable and N independent variables. The independent variables in GLM may be continuous as well as discrete. (The dependent variable is often named "response", independent variables  "factors" and "covariates", depending on whether they are controlled or not).
Consider a clinical trial investigating the effect of two drugs on survival time. Each drug is tested at three levels  "not used", "low dose", "high dose", and all the 9 (=3x3) combinations of the three levels of the two drugs are tested. The following general linear model might have been used:

where Y is survival time (response), i and j correspond to the three levels of drug I and drug II respectively, X is age, C_{i} are additive effects (called "main effects") of each level of drug I, D_{j} are main effects of drug II, R_{ij} are nonadditive effects (called interaction effects or simply "interactions") of drugs I and II, N is random deviation.
We have here three independent variables: two discrete factors  "drug I" and "drug II" with three levels each, and a continuous covariate "age".
In this particular case, because each of the two factors (drugs) has a zero level i,j=1 ("not used"), main effects C_{1}, B_{1}, and interactions R_{1j}, j=1,2,3; R_{i1}, i=1,2,3 are zeros. The remaining unknown coefficients  A, B, C_{i}, D_{j}, R_{ij}  are estimated from the data. The main effects C_{i}, D_{j} of the two drugs and their interaction effects R_{ij} are of primary interest. For example, their positive values would indicate a positive effect  longer survival time due to use of the drug(s).
20. Logistic Regression: Logistic regression is used with binary data when you want to model the probability that a specified outcome will occur. Specifically, it is aimed at estimating parameters a and b in the following model:

where p_{i} is the probability of a success for given value x_{i} of the explanatory variable X.
Use of the log of the odds p/(1p) (the logit) guarantees that the predicted value of p will always be between 0 and 1.
See also: Regression analysis.
29. Naive Bayes Classification: The Naive Bayes method is a method of classification applicable to categorical data, based on Bayes theorem . For a record to be classified, the categories of the predictor variables are noted and the record is classified according to the most frequent class among the same values of those predictor variables in the training set. A rigorous application of the Bayes theorem would require availability of all possible combinations of the values of the predictor variables. When the number of variables is large enough, this requires a training set of unrealistically large size (and, indeed, even a huge training set is unlikely to cover all possible combinations). The naive Bayes method overcomes this practical limitation of the rigorous Bayes approach to classification.
The major idea of the naive Bayes is to use the assumption that predictor variables are independent random variables . This assumption makes it possible to compute probabilities required by the Bayes formula from a relatively small training set.
31.
Principal components analysis: The purpose of principal component analysis is to derive a small number of linear combinations (principal components) of a set of variables that retain as much of the information in the original variables as possible. This technique is often used when there are large numbers of variables, and you wish to reduce them to a smaller number of variable combinations by combining similar variables (ones that contain much the same information).
Principal components are linear combinations of variables that retain maximal amount of information about the variables. The term "maximal amount of information" here means the best leastsquare fit, or, in other words, maximal ability to explain variance of the original data.
In technical terms, a principal component for a given set of Ndimensional data, is a linear combination of the original variables with coefficients equal to the components of an eigenvector of the correlation or covariance matrix. Principal components are usually sorted by descending order of the eigenvalues  i.e. the first principal component corresponds to the eigenvector with the maximal eigenvalue.
33. SixSigma: Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any given point in time if the mean of the sample at that time is within six standard deviations of the overall process mean to that point. In this case, "standard deviation" means the standard deviation of the sample mean. Six sigmas (= six standard deviations) is a very broad range, and the use of sixsigmas, rather than 3sigmas, was popularized by Motorola. It poses substantial demands on the manufacturing process to limit variability of output so that a sixsigmawide band lies within the limits of an acceptable process.
Here is our Glossary: http://www.statistics.com/resources/glossary/
Correlation  emulating human behaviour, correlation allows a user to traverse the complete knowledge payload of any corpus, producing exhaustive sets of pathways in response to NDimensional Queries.
© 2018 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge