Comments - Is E Miner better than Base SAS? - AnalyticBridge2020-04-05T05:18:01Zhttps://www.analyticbridge.datasciencecentral.com/profiles/comment/feed?attachedTo=2004291%3ABlogPost%3A39174&xn_auth=noComparison between SAS Enterp…tag:www.analyticbridge.datasciencecentral.com,2010-01-26:2004291:Comment:595502010-01-26T07:42:14.646ZOleg Danilchenkohttps://www.analyticbridge.datasciencecentral.com/profile/OlegDanilchenko
Comparison between SAS Enterprise Miner and SAS/STAT<br />
<br />
The first major difference between EM and STAT is that EM is GUI, and does not need programming effort. This represents enhanced productivity due to time savings associated with writing and debugging code.<br />
More specifically, among others, the following functionalities are unique to EM:<br />
- Decision Trees for exploratory data analysis and predictive modeling,<br />
- Neural Network models<br />
- Identifying optimal Power Transformations for predictive…
Comparison between SAS Enterprise Miner and SAS/STAT<br />
<br />
The first major difference between EM and STAT is that EM is GUI, and does not need programming effort. This represents enhanced productivity due to time savings associated with writing and debugging code.<br />
More specifically, among others, the following functionalities are unique to EM:<br />
- Decision Trees for exploratory data analysis and predictive modeling,<br />
- Neural Network models<br />
- Identifying optimal Power Transformations for predictive modeling<br />
- Kohonen Networks or Self Organizing Maps.<br />
- Expectation Maximisation Clustering<br />
- Market Basket Analysis (Association Analysis)<br />
- Memory Based Reasoning<br />
- Variable selection based on Chi Sq or R2<br />
- Text Mining<br />
- Link Analysis<br />
- Time Series Analysis<br />
- Reporter, providing an audit trail of your data mining project<br />
- Missing value replacement<br />
- Ensemble modeling including bagging, boosting and combined models<br />
- Profitability modeling for categorical targets<br />
- Multiple model assessment<br />
- Multiple target modeling<br />
- Multistage modeling (combining response models with e.g. profit or revenue models)<br />
- Automatic scoring code generation (in BASE SAS or C)<br />
- Model (regression, neural networks, tree) tuning using validation data. Results in large improvement in efficiency<br />
<br />
Credit Scoring Solution:<br />
- Interactive Grouping<br />
- Scorecard Scaling<br />
- Reject Inference node<br />
- Scorecard Analysis<br />
<br />
<br />
SAS/STAT is primarily targeted to quantitative analysts who possess solid statistical skills and are comfortable working in a programming environment. EM is targeted to both the quantitative and business analyst with a major goal of increasing the productivity of the less experienced analyst through the guided SEMMA based process flow GUI.<br />
The STAT user typically needs to choose the appropriate options from a large host of options. EM also provides numerous options for the quantitative analyst to fine tune the modeling process, but the tools are preset with intelligent defaults to enable the business analyst to obtain at least fair to good results.<br />
<br />
EM has been designed to help bridge the gap between statistics and business intelligence. With SAS STAT, the user is fully responsible for steering the analysis in the right direction. For example, The EM modeling algorithms automatically monitor the performance on the validation ‘hold out’ data to prevent over fitting the training data which is especially helpful for the less seasoned modeler. The STAT user must ensure that the model generalizes well on new hold out data. He or she often relies on personal experience in association with traditional statistics diagnostics for model selection.<br />
<br />
EM is meant to be a productivity tool for analyzing large volumes of data. EM process flows can be shared, modified, applied to new data, and are self documenting. Sampling, data transformations, and filtering are typically done using base SAS code for subsequent analysis in STAT. These tasks are automated in EM through the Sampling, Data Partition, Transformation, and Filtering nodes. The STAT STDIZE procedure does support data imputation but it does not provide the decision tree imputation method of the EM Data Replacement node.<br />
<br />
EM nodes can be configured in numerous ways to support the interactive analysis of large dimensional data sets to support flexible interactive pattern discovery.<br />
<br />
Another EM productivity gain that is worthy of highlighting separately is that EM automatically captures the complete scoring code for all stages of model development in the SAS, C, and JAVA languages. Scoring is the end result of the mining process and EM makes it is easy for the analyst to insert the code into operational scoring systems without having to be concerned with the potential errors that can arise from manual conversions. Very few STAT procedures have scoring options and there is no translation facility for STAT to provide score code in C or java. EM Credit Scoring nodes can further transform regression output into scaled points based scorecards for easy implementation into operational systems.<br />
<br />
EM enables the analyst to generate numerous models and compare the modeling results simultaneously in one single easy to interpret graphical framework. Assessment charts include Profit, Receiver Operating Characteristic (ROC) curve, Diagnostic Classification, Threshold-based Charts, Predicted Plots, Gains Charts, and Lorenz Curves. The ability to do comparative assessment of many different models in order to pick the best one is not provided in STAT.<br />
<br />
The EM client/server architecture allows users to employ large UNIX processors or Mainframes to access and analyze enormous data sources.<br />
<br />
The DMTOOL class allows you to write your own EM node and add it to the tool palette.<br />
<br />
Meta data attributes about the variables are automatically defined which is especially helpful with large data sets.<br />
<br />
Integration with SAS Warehouse Administrator dramatically reduces data preparation time (data preparation estimated to be 80% of the effort in data analysis).<br />
<br />
In terms of credit scorecard development, EM is a risk reduction tool for companies who bring credit scorecard development in-house. Training replacement staff to develop scorecards with EM would take about 1-3 months compared to 8-12 months with programming intensive tools. This is a significant time saver, allowing staff to become productive faster. EM’s time savings also mean that staff can focus on strategy development and analysis rather than programming. This in turn delivers better strategies and lowers losses.<br />
<br />
• Standardised and structured approach facilitates interaction between scorecard developers<br />
• • Graphical representation supports interactive modelling process<br />
• • Enables exchangeability of modellers and modelling teams<br />
• • Reduced training needs for model developers (statistical knowledge given)<br />
• • Benchmarking of linear against non-linear techniques<br />
• • Automated classing reduces time<br />
• • Monitoring documentation reports are standardised and automated<br />
• • Time-saving due to automatically generated tables and graphs<br />
• • Scorecard development process seems to be much faster<br />
• • Good overview, structured and conveniently arranged approach<br />
• • No programming - principally - needed<br />
• • Automated classification of variables<br />
• • Usability of existing macros There are many definitions of…tag:www.analyticbridge.datasciencecentral.com,2010-01-26:2004291:Comment:595472010-01-26T07:38:47.498ZOleg Danilchenkohttps://www.analyticbridge.datasciencecentral.com/profile/OlegDanilchenko
There are many definitions of data mining: the discovery of spurious relationships in the data; automated data<br />
analysis; examination of large data sets; exploratory data analysis. Early methods of data mining included stepwise<br />
regression, cluster analysis, and discriminant analysis. Data mining should be thought of as a process that includes<br />
data cleaning, investigation of the data using models, and validation of potential results. In particular, practical<br />
significance should be emphasized over…
There are many definitions of data mining: the discovery of spurious relationships in the data; automated data<br />
analysis; examination of large data sets; exploratory data analysis. Early methods of data mining included stepwise<br />
regression, cluster analysis, and discriminant analysis. Data mining should be thought of as a process that includes<br />
data cleaning, investigation of the data using models, and validation of potential results. In particular, practical<br />
significance should be emphasized over statistical significance. Also, decision making and the cost of<br />
misclassification is important. SAS/Stat contains methods that can be used to investigate data using a data mining<br />
process. These methods can complement those developed specifically for Enterprise Miner, and can be used in<br />
conjunction with Enterprise Miner. This paper will examine data mining in SAS/Stat, contrasting it with Enterprise<br />
Miner.<br />
<br />
<br />
The term “data mining” has now come into public use with little understanding of what it is or does. In the past,<br />
statisticians have thought little of data mining because data were examined without the final step of model validation.<br />
Data mining differs from standard statistical practice in that the process<br />
Assumes large data samples<br />
Assumes large numbers of variables<br />
Has Validation as a routine, automatic part of the process<br />
Examines the cost of misclassification<br />
Emphasizes decision-making over inference<br />
And yet there remains overlap between statistics and data mining in both technique and practice. Since many of the<br />
techniques, such as cluster analysis, overlap both statistics and data mining, it is the purpose of this paper to<br />
examine some of the differences and similarities in the use of these overlapping techniques from a statistical<br />
perspective versus a data mining perspective.<br />
To demonstrate the different techniques, a dataset consisting of responses to a survey concerning student<br />
expectations in mathematics courses was used throughout. A list of questions asked in the survey is given in the<br />
appendix. The data were coded ordinally and variable names were coded to represent the questions. The data<br />
analysis was performed to investigate the relationship between variables and responses in the dataset. A total of 192<br />
responses were collected from this survey<br />
<br />
<br />
CLUSTER ANALYSIS<br />
As a data mining process, clustering is considered to be unsupervised learning, meaning that there is no specific<br />
outcome variable. Clustering involves the following process:<br />
Group observations or group variables<br />
Choice of type-hierarchical or non-hierarchical<br />
Number of clusters<br />
Cluster identity<br />
Validation<br />
<br />
Hierarchical clustering involves the grouping of observations based upon the distance between observations.<br />
Distance can be defined using different criteria available in PROC CLUSTER. The clusters are built based on a<br />
hierarchical tree structure. The closest observations are grouped together initially followed by the next closest match.<br />
The other method of clustering is based upon the selection of random seed values as the centers of spheres<br />
containing all observations closest to that center (PROC FASTCLUS). Then the centers are re-defined based upon<br />
the outcomes. In Enterprise Miner, PROC FASTCLUS is used to perform clustering. This is the same procedure<br />
available in SAS/Stat. SAS/Stat has the additional hierarchical clustering techniques available. The variables in the<br />
dataset dealing with preferences for mathematics subject were first clustered in SAS/Stat using the hierarchical<br />
procedure in PROC CLUSTER. Several distance criteria were used for comparison purposes. In addition, course<br />
level (200,300,400, 500 and above) was included.<br />
<br />
<br />
<br />
CLASSIFICATION AND PREDICTIVE MODELING<br />
Classification is considered to be supervised data mining since it is possible to compare a predicted value to an<br />
actual value. In SAS/Stat, discriminant analysis and logistic are the primary means of classification. In Enterprise<br />
Miner, neural networks, decision trees, and logistic regression are used. However, the two components of SAS<br />
approach classification with different perspectives. In Enterprise Miner, the dataset is large enough to partition it so<br />
that the classification model can be validated through the use of a holdout sample. Misclassification is one of the<br />
means of determining the strength of the model. Another is to define a profit/loss function to determine the cost (or<br />
benefit) of a correct or incorrect classification.<br />
In SAS/Stat, datasets are often small so that partitioning is not possible. The strength of a logistic regression is<br />
measured by the odds ratios, the receiver operating curve, and the p-value. Similarly, the strength of discriminant<br />
analysis is measured by the proportion of correct classifications without the use of a holdout sample, although there<br />
is an option for cross validation that is not available for logistic regression.<br />
<br />
<br />
Neural networks act like “black boxes” in that the model is not presented in a nice, concise format that is provided by<br />
regression. Its accuracy is examined in a way similar to the diagnostics of the regression curve. The simplest neural<br />
network contains a single input (independent variable) and a single target (dependent variable) with a single output<br />
unit. It increases in complexity with the addition of hidden layers, and additional input variables. With no hidden<br />
layers, the results of a neural network analysis will resemble those of regression. Each input variable is<br />
connected to each variable in the hidden layer, and each hidden variable is connected to each outcome<br />
variable. The hidden units combine inputs, and apply a function to predict outputs. Hidden layers are often<br />
nonlinear.<br />
Decision trees provide a completely different approach to the problem of classification. The decision tree develops a<br />
series of if…then rules. Each rule assigns an observation to one segment of the tree, at which point there is another<br />
if…then rule applied. The initial segment, containing the entire dataset, is the root node for the decision tree. The<br />
final nodes are called leaves. Intermediate nodes (a node plus all its successors) forms a branch of the tree. The<br />
final leaf containing an observation is its predictive value. Unlike neural networks and regression, decision trees will<br />
not work with interval data. It will work with nominal outcomes that have more than two possible results. Decision<br />
trees will also work with ordinal outcome variables.<br />
<br />
<br />
Interestingly, logistic regression has the lowest misclassification rate on the initial training set but the highest<br />
misclassification rate on the testing set. The results suggest that logistic regression tends to inflate results. Similarly,<br />
the receiver operating curve is given in Figure 7 for three models. Note also that the data sample is partitioned into<br />
three sets instead of two. Predictive modeling in Enterprise Miner is iterative. The initial result is compared to the<br />
validation set and adjustments are made to the model. Once all adjustments are completed, the final testing is used<br />
to validate the results.<br />
<br />
<br />
DATA VISUALIZATION<br />
With large datasets, data visualization becomes an important part of exploring in data mining. Version 4.3 of<br />
Enterprise Miner included a node for SAS/Insight, and included all graphics within Insight. Version 5.1 removed the<br />
SAS/Insight Node, adding a Stat/<br />
Explore Node. However, neither yet provides the point-and-click graphics that are readily available in SAS Enterprise<br />
Guide.<br />
There is, however, one set of graphics available in SAS/Stat that are not available in any other component of SAS.<br />
That procedure, PROC KDE, allows the investigator to overlay smoothed histograms to examine data. It is available<br />
for interval data only. In addition to questions concerning preferences for mathematics, students were asked to<br />
estimate the number of hours per week they expected to study for their mathematics courses. In Enterprise Miner, it<br />
is possible to use histograms to examine the data. Figure 8 was provided in Enterprise Miner using the StatExplore<br />
Node. Figure 9 shows the corresponding smoothed histogram as provided by SAS/Stat.<br />
<br />
<br />
CONCLUSION<br />
There are many similar procedures in SAS/Stat and Enterprise Miner that can be used for similar needs. However,<br />
generally the process is different since data are routinely partitioned to validate models, and to optimize the choice of<br />
model. Although SAS/Stat contains many similar models, it does not have the partitioning process readily available,<br />
although it can be coded into the process. There are some exploratory techniques built into SAS/Stat, such as<br />
PROC KDE, that can complement Enterprise Miner. In addition, partitioning and imputation steps can be performed<br />
in Enterprise Miner, and the data analyzed using SAS/Stat techniques brilliantly put Vincent!tag:www.analyticbridge.datasciencecentral.com,2009-04-09:2004291:Comment:414092009-04-09T08:56:19.213ZJohn A Morrisonhttps://www.analyticbridge.datasciencecentral.com/profile/JohnAMorrison
brilliantly put Vincent!
brilliantly put Vincent! You can write decision trees…tag:www.analyticbridge.datasciencecentral.com,2009-03-06:2004291:Comment:391782009-03-06T08:07:27.494ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
You can write decision trees in SAS Base with SAS macros. Not sure I'd want to do it though. But SAS EM is expensive - more expensive than hiring a PhD statistician with 10 years of experience. I believe the best option for decision trees is to write your own code in a language such as C/C++ or Perl. It's pretty easy, and it offers tremendous flexibility, such as producing 200 decision trees each with 12 nodes (good), rather than one tree with 2,400 nodes (bad).
You can write decision trees in SAS Base with SAS macros. Not sure I'd want to do it though. But SAS EM is expensive - more expensive than hiring a PhD statistician with 10 years of experience. I believe the best option for decision trees is to write your own code in a language such as C/C++ or Perl. It's pretty easy, and it offers tremendous flexibility, such as producing 200 decision trees each with 12 nodes (good), rather than one tree with 2,400 nodes (bad).