There are many definitions of data mining: the discovery of spurious relationships in the data; automated data
analysis; examination of large data sets; exploratory data analysis. Early methods of data mining included stepwise
regression, cluster analysis, and discriminant analysis. Data mining should be thought of as a process that includes
data cleaning, investigation of the data using models, and validation of potential results. In particular, practical
significance should be emphasized over statistical significance. Also, decision making and the cost of
misclassification is important. SAS/Stat contains methods that can be used to investigate data using a data mining
process. These methods can complement those developed specifically for Enterprise Miner, and can be used in
conjunction with Enterprise Miner. This paper will examine data mining in SAS/Stat, contrasting it with Enterprise
The term “data mining” has now come into public use with little understanding of what it is or does. In the past,
statisticians have thought little of data mining because data were examined without the final step of model validation.
Data mining differs from standard statistical practice in that the process
Assumes large data samples
Assumes large numbers of variables
Has Validation as a routine, automatic part of the process
Examines the cost of misclassification
Emphasizes decision-making over inference
And yet there remains overlap between statistics and data mining in both technique and practice. Since many of the
techniques, such as cluster analysis, overlap both statistics and data mining, it is the purpose of this paper to
examine some of the differences and similarities in the use of these overlapping techniques from a statistical
perspective versus a data mining perspective.
To demonstrate the different techniques, a dataset consisting of responses to a survey concerning student
expectations in mathematics courses was used throughout. A list of questions asked in the survey is given in the
appendix. The data were coded ordinally and variable names were coded to represent the questions. The data
analysis was performed to investigate the relationship between variables and responses in the dataset. A total of 192
responses were collected from this survey
As a data mining process, clustering is considered to be unsupervised learning, meaning that there is no specific
outcome variable. Clustering involves the following process:
Group observations or group variables
Choice of type-hierarchical or non-hierarchical
Number of clusters
Hierarchical clustering involves the grouping of observations based upon the distance between observations.
Distance can be defined using different criteria available in PROC CLUSTER. The clusters are built based on a
hierarchical tree structure. The closest observations are grouped together initially followed by the next closest match.
The other method of clustering is based upon the selection of random seed values as the centers of spheres
containing all observations closest to that center (PROC FASTCLUS). Then the centers are re-defined based upon
the outcomes. In Enterprise Miner, PROC FASTCLUS is used to perform clustering. This is the same procedure
available in SAS/Stat. SAS/Stat has the additional hierarchical clustering techniques available. The variables in the
dataset dealing with preferences for mathematics subject were first clustered in SAS/Stat using the hierarchical
procedure in PROC CLUSTER. Several distance criteria were used for comparison purposes. In addition, course
level (200,300,400, 500 and above) was included.
CLASSIFICATION AND PREDICTIVE MODELING
Classification is considered to be supervised data mining since it is possible to compare a predicted value to an
actual value. In SAS/Stat, discriminant analysis and logistic are the primary means of classification. In Enterprise
Miner, neural networks, decision trees, and logistic regression are used. However, the two components of SAS
approach classification with different perspectives. In Enterprise Miner, the dataset is large enough to partition it so
that the classification model can be validated through the use of a holdout sample. Misclassification is one of the
means of determining the strength of the model. Another is to define a profit/loss function to determine the cost (or
benefit) of a correct or incorrect classification.
In SAS/Stat, datasets are often small so that partitioning is not possible. The strength of a logistic regression is
measured by the odds ratios, the receiver operating curve, and the p-value. Similarly, the strength of discriminant
analysis is measured by the proportion of correct classifications without the use of a holdout sample, although there
is an option for cross validation that is not available for logistic regression.
Neural networks act like “black boxes” in that the model is not presented in a nice, concise format that is provided by
regression. Its accuracy is examined in a way similar to the diagnostics of the regression curve. The simplest neural
network contains a single input (independent variable) and a single target (dependent variable) with a single output
unit. It increases in complexity with the addition of hidden layers, and additional input variables. With no hidden
layers, the results of a neural network analysis will resemble those of regression. Each input variable is
connected to each variable in the hidden layer, and each hidden variable is connected to each outcome
variable. The hidden units combine inputs, and apply a function to predict outputs. Hidden layers are often
Decision trees provide a completely different approach to the problem of classification. The decision tree develops a
series of if…then rules. Each rule assigns an observation to one segment of the tree, at which point there is another
if…then rule applied. The initial segment, containing the entire dataset, is the root node for the decision tree. The
final nodes are called leaves. Intermediate nodes (a node plus all its successors) forms a branch of the tree. The
final leaf containing an observation is its predictive value. Unlike neural networks and regression, decision trees will
not work with interval data. It will work with nominal outcomes that have more than two possible results. Decision
trees will also work with ordinal outcome variables.
Interestingly, logistic regression has the lowest misclassification rate on the initial training set but the highest
misclassification rate on the testing set. The results suggest that logistic regression tends to inflate results. Similarly,
the receiver operating curve is given in Figure 7 for three models. Note also that the data sample is partitioned into
three sets instead of two. Predictive modeling in Enterprise Miner is iterative. The initial result is compared to the
validation set and adjustments are made to the model. Once all adjustments are completed, the final testing is used
to validate the results.
With large datasets, data visualization becomes an important part of exploring in data mining. Version 4.3 of
Enterprise Miner included a node for SAS/Insight, and included all graphics within Insight. Version 5.1 removed the
SAS/Insight Node, adding a Stat/
Explore Node. However, neither yet provides the point-and-click graphics that are readily available in SAS Enterprise
There is, however, one set of graphics available in SAS/Stat that are not available in any other component of SAS.
That procedure, PROC KDE, allows the investigator to overlay smoothed histograms to examine data. It is available
for interval data only. In addition to questions concerning preferences for mathematics, students were asked to
estimate the number of hours per week they expected to study for their mathematics courses. In Enterprise Miner, it
is possible to use histograms to examine the data. Figure 8 was provided in Enterprise Miner using the StatExplore
Node. Figure 9 shows the corresponding smoothed histogram as provided by SAS/Stat.
There are many similar procedures in SAS/Stat and Enterprise Miner that can be used for similar needs. However,
generally the process is different since data are routinely partitioned to validate models, and to optimize the choice of
model. Although SAS/Stat contains many similar models, it does not have the partitioning process readily available,
although it can be coded into the process. There are some exploratory techniques built into SAS/Stat, such as
PROC KDE, that can complement Enterprise Miner. In addition, partitioning and imputation steps can be performed
in Enterprise Miner, and the data analyzed using SAS/Stat techniques
Comparison between SAS Enterprise Miner and SAS/STAT
The first major difference between EM and STAT is that EM is GUI, and does not need programming effort. This represents enhanced productivity due to time savings associated with writing and debugging code.
More specifically, among others, the following functionalities are unique to EM:
- Decision Trees for exploratory data analysis and predictive modeling,
- Neural Network models
- Identifying optimal Power Transformations for predictive modeling
- Kohonen Networks or Self Organizing Maps.
- Expectation Maximisation Clustering
- Market Basket Analysis (Association Analysis)
- Memory Based Reasoning
- Variable selection based on Chi Sq or R2
- Text Mining
- Link Analysis
- Time Series Analysis
- Reporter, providing an audit trail of your data mining project
- Missing value replacement
- Ensemble modeling including bagging, boosting and combined models
- Profitability modeling for categorical targets
- Multiple model assessment
- Multiple target modeling
- Multistage modeling (combining response models with e.g. profit or revenue models)
- Automatic scoring code generation (in BASE SAS or C)
- Model (regression, neural networks, tree) tuning using validation data. Results in large improvement in efficiency
SAS/STAT is primarily targeted to quantitative analysts who possess solid statistical skills and are comfortable working in a programming environment. EM is targeted to both the quantitative and business analyst with a major goal of increasing the productivity of the less experienced analyst through the guided SEMMA based process flow GUI.
The STAT user typically needs to choose the appropriate options from a large host of options. EM also provides numerous options for the quantitative analyst to fine tune the modeling process, but the tools are preset with intelligent defaults to enable the business analyst to obtain at least fair to good results.
EM has been designed to help bridge the gap between statistics and business intelligence. With SAS STAT, the user is fully responsible for steering the analysis in the right direction. For example, The EM modeling algorithms automatically monitor the performance on the validation ‘hold out’ data to prevent over fitting the training data which is especially helpful for the less seasoned modeler. The STAT user must ensure that the model generalizes well on new hold out data. He or she often relies on personal experience in association with traditional statistics diagnostics for model selection.
EM is meant to be a productivity tool for analyzing large volumes of data. EM process flows can be shared, modified, applied to new data, and are self documenting. Sampling, data transformations, and filtering are typically done using base SAS code for subsequent analysis in STAT. These tasks are automated in EM through the Sampling, Data Partition, Transformation, and Filtering nodes. The STAT STDIZE procedure does support data imputation but it does not provide the decision tree imputation method of the EM Data Replacement node.
EM nodes can be configured in numerous ways to support the interactive analysis of large dimensional data sets to support flexible interactive pattern discovery.
Another EM productivity gain that is worthy of highlighting separately is that EM automatically captures the complete scoring code for all stages of model development in the SAS, C, and JAVA languages. Scoring is the end result of the mining process and EM makes it is easy for the analyst to insert the code into operational scoring systems without having to be concerned with the potential errors that can arise from manual conversions. Very few STAT procedures have scoring options and there is no translation facility for STAT to provide score code in C or java. EM Credit Scoring nodes can further transform regression output into scaled points based scorecards for easy implementation into operational systems.
EM enables the analyst to generate numerous models and compare the modeling results simultaneously in one single easy to interpret graphical framework. Assessment charts include Profit, Receiver Operating Characteristic (ROC) curve, Diagnostic Classification, Threshold-based Charts, Predicted Plots, Gains Charts, and Lorenz Curves. The ability to do comparative assessment of many different models in order to pick the best one is not provided in STAT.
The EM client/server architecture allows users to employ large UNIX processors or Mainframes to access and analyze enormous data sources.
The DMTOOL class allows you to write your own EM node and add it to the tool palette.
Meta data attributes about the variables are automatically defined which is especially helpful with large data sets.
Integration with SAS Warehouse Administrator dramatically reduces data preparation time (data preparation estimated to be 80% of the effort in data analysis).
In terms of credit scorecard development, EM is a risk reduction tool for companies who bring credit scorecard development in-house. Training replacement staff to develop scorecards with EM would take about 1-3 months compared to 8-12 months with programming intensive tools. This is a significant time saver, allowing staff to become productive faster. EM’s time savings also mean that staff can focus on strategy development and analysis rather than programming. This in turn delivers better strategies and lowers losses.
• Standardised and structured approach facilitates interaction between scorecard developers
• • Graphical representation supports interactive modelling process
• • Enables exchangeability of modellers and modelling teams
• • Reduced training needs for model developers (statistical knowledge given)
• • Benchmarking of linear against non-linear techniques
• • Automated classing reduces time
• • Monitoring documentation reports are standardised and automated
• • Time-saving due to automatically generated tables and graphs
• • Scorecard development process seems to be much faster
• • Good overview, structured and conveniently arranged approach
• • No programming - principally - needed
• • Automated classification of variables
• • Usability of existing macros