Comments - Data Science Dictionary - AnalyticBridge2018-05-27T15:31:25Zhttps://www.analyticbridge.datasciencecentral.com/profiles/comment/feed?attachedTo=2004291%3ABlogPost%3A223153&xn_auth=noDepending on the level of exp…tag:www.analyticbridge.datasciencecentral.com,2015-07-02:2004291:Comment:3283372015-07-02T00:22:41.732ZGee Teehttps://www.analyticbridge.datasciencecentral.com/profile/GeeTee
<p>Depending on the level of exposure and experience of the decision makers and other interested users of this Data Science dictionary, it may be beneficial to provide practical illustrations to a selection of the definitions.....that is, if the 40 word restriction would permit it....!</p>
<p>Depending on the level of exposure and experience of the decision makers and other interested users of this Data Science dictionary, it may be beneficial to provide practical illustrations to a selection of the definitions.....that is, if the 40 word restriction would permit it....!</p> Principal Component Analysis…tag:www.analyticbridge.datasciencecentral.com,2013-06-11:2004291:Comment:2506782013-06-11T16:57:00.095ZDonald Krapohlhttps://www.analyticbridge.datasciencecentral.com/profile/DonaldKrapohl
<p><strong>Principal Component Analysis (PCA)</strong> - technique used to analyze and weight predictors to predetermine optimal model descrimination<br/> <strong>Sensitivity Analysis</strong> - process used to determine the sensitivity of a predictive model to noise, missing data, outliers, and other anomalies in the model predictors</p>
<p><strong>Principal Component Analysis (PCA)</strong> - technique used to analyze and weight predictors to predetermine optimal model descrimination<br/> <strong>Sensitivity Analysis</strong> - process used to determine the sensitivity of a predictive model to noise, missing data, outliers, and other anomalies in the model predictors</p> 22. Maximum Likelihood - see…tag:www.analyticbridge.datasciencecentral.com,2013-05-06:2004291:Comment:2445892013-05-06T19:24:40.581ZTom Millerhttps://www.analyticbridge.datasciencecentral.com/profile/TomMiller493
<p>22. <span>Maximum Likelihood</span> - see end results of "34. Stepwise Regression". (probably not either precise enough or unpacked enough as a definition).</p>
<p>22. <span>Maximum Likelihood</span> - see end results of "34. Stepwise Regression". (probably not either precise enough or unpacked enough as a definition).</p> 33. "Six Sigma is a set of to…tag:www.analyticbridge.datasciencecentral.com,2013-05-06:2004291:Comment:2445872013-05-06T19:12:34.498ZTom Millerhttps://www.analyticbridge.datasciencecentral.com/profile/TomMiller493
<p>33. "<b>Six Sigma</b><span> is a set of tools and strategies for process improvement originally developed by </span><a href="http://en.wikipedia.org/wiki/Motorola" title="Motorola">Motorola</a><span> in 1985....<span>Six Sigma seeks to improve the quality of process outputs by identifying and removing the causes of defects (errors) and minimizing </span><a href="http://en.wikipedia.org/wiki/Statistical_dispersion" title="Statistical dispersion">variability</a><span> in …</span></span></p>
<p>33. "<b>Six Sigma</b><span> is a set of tools and strategies for process improvement originally developed by </span><a href="http://en.wikipedia.org/wiki/Motorola" title="Motorola">Motorola</a><span> in 1985....<span>Six Sigma seeks to improve the quality of process outputs by identifying and removing the causes of defects (errors) and minimizing </span><a href="http://en.wikipedia.org/wiki/Statistical_dispersion" title="Statistical dispersion">variability</a><span> in </span><a href="http://en.wikipedia.org/wiki/Manufacturing" title="Manufacturing">manufacturing</a><span> and </span><a href="http://en.wikipedia.org/wiki/Business_process" title="Business process">business processes</a>" excerpted from: <a href="http://en.wikipedia.org/wiki/Six_sigma">http://en.wikipedia.org/wiki/Six_sigma</a> </span></p>
<p><span>I have been studying Six Sigma through a couple different courses. This is a good short summary. The details are more extensive of course. So the question really is which version of "Six Sigma" are we looking for here :)</span></p>
<p></p> A few new terms (Source: http…tag:www.analyticbridge.datasciencecentral.com,2013-04-20:2004291:Comment:2424972013-04-20T18:27:45.388ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>A few new terms (Source: <a href="http://hkotadia.com/archives/5427">http://hkotadia.com/archives/5427</a>)</p>
<p><span>1. Hadoop: System for processing very large data sets</span><br></br>2. HDFS or Hadoop Distributed File System: For storage of large volume of data (key elements – Datanodes, Namenode and Tasktracker)<br></br>3. MapReduce: Think of it as Assembly level language for distributed computing. Used for computation in Hadoop<br></br>4. Pig: Developed by Yahoo. It is a higher level language…</p>
<p>A few new terms (Source: <a href="http://hkotadia.com/archives/5427">http://hkotadia.com/archives/5427</a>)</p>
<p><span>1. Hadoop: System for processing very large data sets</span><br/>2. HDFS or Hadoop Distributed File System: For storage of large volume of data (key elements – Datanodes, Namenode and Tasktracker)<br/>3. MapReduce: Think of it as Assembly level language for distributed computing. Used for computation in Hadoop<br/>4. Pig: Developed by Yahoo. It is a higher level language than MapReduce<br/>5. Hive: Higher level language developed by Facebook with SQL like syntax<br/>6. Apache HBase: For real-time access to Hadoop data<br/>7. Accumulo: Improved HBase with new features like cell level security<br/>8. AVRO: New data serialization format (protocol buffers etc.)<br/>9. Apache ZooKeeper: Distributed co-ordination system<br/>10. HCatalog: For combining meta store of Hive and merging with what Pig does<br/>11. Oozie: Scheduling system developed by Yahoo<br/>12. Flume: Log aggregation system<br/>13. Whirr: For automating hadoop cluster processing<br/>14. Sqoop: For transfering structured data to Hadoop<br/>15. Mahout: Machine learning on top of MapReduce<br/>16: Bigtop: Integrate multiple Hadoop sub-systems into one that works as a whole<br/>17. Crunch: Runs on top of MapReduce, Java API for tedious tasks like joining and data aggregation.<br/>18. Giraph: Used for large scale distributed graph processing</p>
<p>Also, embedded below is an excellent TechTalk by Jakob Homan of LinkedIn on the subject explaining these tech terms.</p> 5. Cross-Validation: Cross-…tag:www.analyticbridge.datasciencecentral.com,2012-11-29:2004291:Comment:2240042012-11-29T21:48:40.478ZJanet Dobbinshttps://www.analyticbridge.datasciencecentral.com/profile/JanetDobbins657
<p>5. <font face="verdana" size="-1"><font face="verdana" size="-1"><b>Cross-Validation: </b></font></font> Cross-validation is a general computer-intensive approach used in estimating the accuracy of statistical models. The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times - with each of…</p>
<p>5. <font face="verdana" size="-1"><font face="verdana" size="-1"><b>Cross-Validation: </b></font></font> Cross-validation is a general computer-intensive approach used in estimating the accuracy of statistical models. The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times - with each of the N subsets being used as the <a href="http://www.statistics.com/index.php?page=glossary&term_id=873">validation set</a> . Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error.</p>
<p>The cross-validation is used in various classification and prediction procedures, such as <a href="http://www.statistics.com/index.php?page=glossary&term_id=482">regression analysis</a> , <a href="http://www.statistics.com/index.php?page=glossary&term_id=508">discriminant analysis</a> , <a href="http://www.statistics.com/index.php?page=glossary&term_id=266">neural networks</a> and <a href="http://www.statistics.com/index.php?page=glossary&term_id=721">classification and regression trees (CART)</a> .</p>
<p></p>
<p>7. <font face="verdana" size="-1"><font face="verdana" size="-1"><b>Design of Experiments: </b></font></font> Design of experiments is concerned with optimization of the plan of experimental studies. The goal is to improve the quality of the decision that is made from the outcome of the study on the basis of statistical methods, and to ensure that maximum information is obtained from scarce experimental data.</p>
<p>If the decision making process is based on statistical <a href="http://www.statistics.com/index.php?page=glossary&term_id=634">hypothesis testing</a> (e.g. on <a href="http://www.statistics.com/index.php?page=glossary&term_id=609">analysis of variance</a> ), then the goal is to increase the <a href="http://www.statistics.com/index.php?page=glossary&term_id=662">power</a> of the <a href="http://www.statistics.com/index.php?page=glossary&term_id=670">statistical test</a> . If the decision making process is based on <a href="http://www.statistics.com/index.php?page=glossary&term_id=382">estimation</a> of the <a href="http://www.statistics.com/index.php?page=glossary&term_id=402">parameters</a> of interest (e.g. using <a href="http://www.statistics.com/index.php?page=glossary&term_id=482">regression analysis</a> ), then the goal is to increase the precision of the estimates of the parameters derived from the outcome of the experiment.</p>
<p>The term"design" is also used to refer to the specific plan of the experiment that has been obtained in course of the designing procedure. See, for example, <a href="http://www.statistics.com/index.php?page=glossary&term_id=424">crossover design</a> , <a href="http://www.statistics.com/index.php?page=glossary&term_id=439">parallel design</a> , <a href="http://www.statistics.com/index.php?page=glossary&term_id=443">self-controlled design</a> , <a href="http://www.statistics.com/index.php?page=glossary&term_id=722">complete block design</a> .</p>
<p>See also: <a href="http://www.statistics.com/index.php?page=glossary&term_id=448">variables in the design of experiments</a> , <a href="http://www.statistics.com/index.php?page=glossary&term_id=769">general linear model for a latin square</a> .</p>
<p></p>
<p>12. <b>General Linear Model: </b> General (or generalized) linear models (GLM), in contrast to <a href="http://www.statistics.com/index.php?page=glossary&term_id=469">linear model</a>s, allow you to describe both additive and non-additive relationship between a dependent variable and N independent variables. The independent variables in GLM may be continuous as well as discrete. (The dependent variable is often named "response", independent variables - "factors" and "covariates", depending on whether they are controlled or not).</p>
<p>Consider a clinical trial investigating the effect of two drugs on survival time. Each drug is tested at three levels - "not used", "low dose", "high dose", and all the 9 (=3x3) combinations of the three levels of the two drugs are tested. The following general linear model might have been used:</p>
<table border="0" width="100%">
<tbody><tr><td><table align="center">
<tbody><tr><td align="center" nowrap="nowrap">Y<sub>ij</sub> = A + B X + C<sub>i</sub> + D<sub>j</sub> + R<sub>ij</sub> + N; i,j = 1,2,3;</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>where Y is survival time (response), i and j correspond to the three levels of drug I and drug II respectively, X is age, C<sub>i</sub> are additive effects (called "main effects") of each level of drug I, D<sub>j</sub> are main effects of drug II, R<sub>ij</sub> are non-additive effects (called <a href="http://www.statistics.com/index.php?page=glossary&term_id=464">interaction effect</a>s or simply "interactions") of drugs I and II, N is random deviation.</p>
<p>We have here three independent variables: two discrete factors - "drug I" and "drug II" with three levels each, and a continuous covariate "age".</p>
<p><font face="verdana" size="-1">In this particular case, because each of the two factors (drugs) has a zero level i,j=1 ("not used"), main effects C<sub>1</sub>, B<sub>1</sub>, and interactions R<sub>1j</sub>, j=1,2,3; R<sub>i1</sub>, i=1,2,3 are zeros. The remaining unknown coefficients - A, B, C<sub>i</sub>, D<sub>j</sub>, R<sub>ij</sub> - are estimated from the data. The main effects C<sub>i</sub>, D<sub>j</sub> of the two drugs and their interaction effects R<sub>ij</sub> are of primary interest. For example, their positive values would indicate a positive effect - longer survival time due to use of the drug(s).</font></p>
<p><font face="verdana" size="-1">20.</font> <b>Logistic Regression: </b> Logistic regression is used with binary data when you want to model the probability that a specified outcome will occur. Specifically, it is aimed at estimating parameters a and b in the following model:</p>
<table border="0" width="100%">
<tbody><tr><td><table align="center">
<tbody><tr><td align="center" nowrap="nowrap">L<sub>i</sub> = log</td>
<td align="center" nowrap="nowrap"> p<sub>i</sub><div class="hrcomp"><hr noshade="noshade" size="1"/></div>
1<font face="symbol">-</font>p<sub>i</sub></td>
<td align="center" nowrap="nowrap">= a + b x<sub>i</sub>,</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>where p<sub>i</sub> is the probability of a success for given value x<sub>i</sub> of the explanatory variable X.</p>
<p>Use of the log of the odds p/(1-p) (the logit) guarantees that the predicted value of p will always be between 0 and 1.</p>
<p>See also: <a href="http://www.statistics.com/index.php?page=glossary&term_id=482">Regression analysis</a>.</p>
<p><font face="verdana" size="-1">2<font size="-1">9. <font face="verdana" size="-1"><font face="verdana" size="-1"><b>Naive Bayes Classification: </b></font></font></font></font> The Naive Bayes method is a method of classification applicable to categorical data, based on <a href="http://www.statistics.com/index.php?page=glossary&term_id=211">Bayes theorem</a> . For a record to be classified, the categories of the predictor variables are noted and the record is classified according to the most frequent class among the same values of those predictor variables in the training set. A rigorous application of the Bayes theorem would require availability of all possible combinations of the values of the predictor variables. When the number of variables is large enough, this requires a <a href="http://www.statistics.com/index.php?page=glossary&term_id=864">training set</a> of unrealistically large size (and, indeed, even a huge training set is unlikely to cover all possible combinations). The naive Bayes method overcomes this practical limitation of the rigorous Bayes approach to classification.</p>
<p>The major idea of the naive Bayes is to use the assumption that predictor variables are <a href="http://www.statistics.com/index.php?page=glossary&term_id=574">independent random variables</a> . This assumption makes it possible to compute probabilities required by the Bayes formula from a relatively small training set.</p>
<p>31.</p>
<p><b>Principal components analysis: </b> The purpose of principal component analysis is to derive a small number of linear combinations (principal components) of a set of variables that retain as much of the information in the original variables as possible. This technique is often used when there are large numbers of variables, and you wish to reduce them to a smaller number of variable combinations by combining similar variables (ones that contain much the same information).</p>
<p>Principal components are linear combinations of variables that retain maximal amount of information about the variables. The term "maximal amount of information" here means the best least-square fit, or, in other words, maximal ability to explain variance of the original data.</p>
<p><font face="verdana" size="-1">In technical terms, a principal component for a given set of N-dimensional data, is a linear combination of the original variables with coefficients equal to the components of an eigenvector of the correlation or covariance matrix. Principal components are usually sorted by descending order of the eigenvalues - i.e. the first principal component corresponds to the eigenvector with the maximal eigenvalue.</font></p>
<p>33. <b>Six-Sigma:</b> <font face="verdana" size="-1">Six sigma means literally six <a href="http://www.statistics.com/index.php?page=glossary&term_id=357">standard deviations</a>. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any given point in time if the mean of the sample at that time is within six standard deviations of the overall process mean to that point. In this case, "standard deviation" means the standard deviation of the sample mean. Six sigmas (= six standard deviations) is a very broad range, and the use of six-sigmas, rather than 3-sigmas, was popularized by Motorola. It poses substantial demands on the manufacturing process to limit variability of output so that a six-sigma-wide band lies within the limits of an acceptable process.</font></p>
<p>Here is our Glossary: <a href="http://www.statistics.com/resources/glossary/" target="_blank">http://www.statistics.com/resources/glossary/</a></p>
<p></p> Correlation - emulating human…tag:www.analyticbridge.datasciencecentral.com,2012-11-18:2004291:Comment:2231622012-11-18T15:44:58.087ZCarl Wimmerhttps://www.analyticbridge.datasciencecentral.com/profile/CarlWimmer
<p>Correlation - emulating human behaviour, correlation allows a user to traverse the complete knowledge payload of any corpus, producing exhaustive sets of pathways in response to N-Dimensional Queries.</p>
<p>Correlation - emulating human behaviour, correlation allows a user to traverse the complete knowledge payload of any corpus, producing exhaustive sets of pathways in response to N-Dimensional Queries.</p>