This book is a compilation of essays that offer succinct, specific methods for solving the most commonly experienced problems in database. The common theme among these essays is to address each methodology and assign its application to a specific type of problem. To better ground the reader, I spend considerable time discussing the basic methodologies of database analysis and modeling. While this type of overview has been attempted before, my approach offers a truly nitty-gritty, step-by-step approach that both tyros and experts in the field can enjoy playing with. The job of the data analyst is overwhelmingly to predict and explain the result of the target variable, such as RESPONSE or PROFIT. Within that task, the target variable is either a binary variable (RESPONSE is one such example) or a continuous variable (of which PROFIT is a good example.) The scope of this book is purposely limited to dependency models, for which the target variable is often referred to as the “left-hand” side of an equation, and the variables that predict and/or explain the target variable is the “right-hand” side. This is in contrast to interdependency models that have no left or right hand side, and are not covered in this book. Since interdependency models comprise a minimal proportion of the analyst’s workload, the author humbly suggests that the focus of this book will prove utilitarian.
Therefore, these essays have been organized in the following fashion. To provide a springboard into more esoteric methodologies, Chapter 2 covers the correlation coefficient. While reviewing the correlation coefficient, I bring to light several issues that many are unfamiliar with, as well as introducing two useful methods for variable assessment. In Chapter 3, I deal with logistic regression, a classification technique familiar to everyone, yet in this book, serves as the underlying rationale for a case study in building a response model for an investment product. In doing so, I introduce a variety of new data mining techniques. The continuous side of this target variable in covered in Chapter 4. Chapters 5 and 6 focus on the regression coefficient, and offer several common misinterpretations of the concept that point to the weaknesses in the method. Thus, in Chapter 7 I offer an alternative measure – the predictive contribution coefficient – which offers greater utility than the standardized coefficient.
Up to this juncture, I have dealt solely with the variables in a model. Beginning with Chapter 8, I demonstrate how to increase a model’s predicative power beyond provided by its variable components. This is accomplished by creating an interaction variable, which is the product of two or more component variables. To test the significance of my interaction variable, I make what I feel to be a compelling case for a rather unconventional use of CHAID. Creative use of well known techniques is furthered carried out in Chapter 9, where I solve the problem of market segment classification modeling using not only logistic regression, but CHAID as well. In Chapter 10, CHAID is yet again utilized in a somewhat unconventional manner – as a method for filling in missing values in one’s data. In order to bring an interesting real life problem into the picture, I wrote Chapter 11 to describe profiling techniques for the database marketer who wants a method for identifying his or her best customers. The benefits of the predictive profiling approach is demonstrated and expanded to a discussion of look alike profiling.
I take a detour in Chapter 12 to discuss how database marketers assess the accuracy of the models. Three concepts of model assessment are discussed – the traditional decile analysis – as well as two additional concepts I introduce: precision and separability. Continuing in this mode, Chapter 13 points to the weaknesses in the way decile analysis is used, and instead offers a new approach known as the bootstrap for measuring the efficiency of database models. Chapter 14 offers a pair of graphics, or visual displays that have value beyond the commonly used exploratory phase of analysis. In this chapter, I demonstrate the hitherto untapped potential for visual displays to describe the functionality of the final model once it has been implemented for prediction.
With the discussions described above behind us, we are ready to venture to new ground. In Chapter 1, I talked about statistics and machine learning, and I defined that statistical learning is the ability to solve statistical problems using non-statistical machine learning. GenIQ is now presented in Chapter 15 as such a non-statistical machine-learning model. Moreover, in Chapter 16 GenIQ serves as an effective method for finding the best possible subset of variables for a model. Since GenIQ has no coefficients – and coefficients are the paradigm for prediction – Chapter 17 presents a method for calculating a quasi-regression coefficient, thereby providing a reliable, assumption-free alternative to the regression coefficient. Such an alternative provides a frame of reference for evaluating and using coefficient-free models, thus allowing the data analyst a comfort level for exploring new ideas, such as GenIQ.