<p>Hi - I am trying to determine a good tool (that requires minimal additional effort) that will help me generate a probability of a sale for a list of of 300,000 products. I have attached a sample of the data, with 20,000 records.</p>
<p>Basically, I have a table of historical sales data (with about 300,000 records) that contains around 8 continuous independent variables along with a dependent variable that has a yes/no (i.e., binary outcome) value indicating whether product in the list has had a sale in the past 12 months.</p>
<p>The historical data essentially looks like this.</p>
<p>Product1,2,3 etc<br/>Variable 1<br/>Variable 2 <br/>Variable 3<br/>Variable 4<br/>Variable 5<br/>Variable 6<br/>Variable 7<br/>Variable 8<br/>[B]Sold in past 12 months[/B] (Yes or No)</p>
<p>The last variable in the list is of course the dependent variable.</p>
<p>All I want to do is to find a tool that is going to be the best or easiest to use, so that I can assign a probability to each product in the list, essentially giving me the chance to condense my list to the products that are the highest likelihood to generate a sale, so that I can list those products instead of the others that have lower probability of generating a sale.</p>
<p>Ideally, the tool could do a quick logistic regression, or some other probability calculation based on the available variables, and thereby give me a (RVU-like) number (perhaps a probability ranging from 0 to 1) for each product, allowing me to quickly select the top 50,000 products to list on a website, since they have the higher probability of generating a sale according to the available variables.</p>
<p>I am of course assuming that the variables are somehow correlated to the outcome, but perhaps the tool will help me determine that.</p>
<p>Does anyone have any suggestions of a good tool to accomplish this? I would presume that there is a simple way to set this up in Microsoft Excel, but if not, then a piece of software that does this would of course be great too.</p>
<p>Or, feel free to review the actual sample data set, to help me understand how best to approach analyzing the data, and whether I should eliminate certain variables from the results. </p>
Queries in modeling
<p>Hi all,</p>
<p>I am from engineering background. I would require your help in certain modeling concepts. Your help would be greatly appreciated!</p>
<p>Following are my few questions...</p>
<ol>
<li>If a variable which is important from business standpoint has a p-value of 0.5, then should it be considered in the model? If Yes, then wouldn't it make the model coefficients unstable?</li>
</ol>
<ol>
<li>If a variable which is important from business standpoint has a p-value of 0.5, then should it be considered in the model? If Yes, then wouldn't it make the model coefficients unstable?</li>
<li>Should I standardize the variables before building a logistic regression model? If Yes, is there a commonly followed approach?</li>
<li>I am planning to develop a logistic regression to rate the employees as good or bad. The model includes variables such as his innovation score, #papers published, salary, Training cost, etc. First two are kind of assets to the company and the next two are kind of liabilities. Should I explicitly make the model understand this by considering the liabilities as negative values?</li>
<li>I have two independent variables in my LR model. Var1 has levels 'A' and 'B'. Var2 has levels 'X' and 'Y'. Of the entire dataset, there are 30% observations with Var1 as 'A' and Var2 as 'X', 35% observations with Var1 as 'A' and Var2 as 'Y', 30% observations with Var1 as 'B' and Var2 as 'X', 5% observations with Var1 as 'B' and Var2 as 'Y'. The number of observations with Var1 as 'B' and Var2 as 'Y' are far too less compared to other combinations. Is this skewness in data going to affect my results? If so, how should I rectify this?</li>
Random Forests vs MARS vs Linear regression
<p>Hi all, I would like to get the group's view on the advantages and disadvantages of Random Forests and MARS modelling vs Linear regression. It would be interesting to compare them both at a statistical principles level, but also in their usefulness to econometrics.</p>
Techniques to address very low event rate for Logistic Regression Model
<p> Hi Folks,</p>
<p></p>
<p>I am looking at data form a telecom company and developing model to predict an event ( read churn).</p>
<p></p>
<p>I am planning to develop GLM using logit link function.</p>
<p>The real problem I am facing in the data is - very low volume (1.6 %) of churners.</p>
<p>So seeking advise on the following ;</p>
<p>- What are the possible (bad) outcomes if I take randomised training sample, consisting just 1.6 % churners ?</p>
Understanding the Kalman Filter Application in Economic Time Series Data
<div class="discussion"><div class="description"><div class="xg_user_generated"><p>The Kalman filter has been extensively used in Science for various applications, from detecting missile targets to just any changing scenario that can be learned.</p>
<p>I'm trying to understand how Kalman Filter can be applied on Time Series data with Exogenous variables - in a nutshell, trying to replicate PROC UCM in excel.</p>
<p>State-space equation :</p>
<p><img alt="Kalman - equation 1" border="p" height="23" src="http://bilgin.esme.org/Portals/0/images/kalman/equation1.gif" width="215"></img></p>
<p><img alt="Kalman - equation 2" border="0" height="30" src="http://bilgin.esme.org/Portals/0/images/kalman/equation2.gif" width="115"></img></p>
Does R:NR ratio matter in deciding what technique we use for modeling?
I came across some speculation on R:NR ratio to decide the technique that needs to be employed. I haven't found any documentation or proof as yet, so I thought I'd get some feedback/comments on the same.<br></br><br></br>Taking 3 scenarios of modeling situation:<br></br>We have a 3 populations of 100K customers, targeted by 3 different programs<br></br><br></br>Situation A - 5% have responded to a program of ours.<br></br>Situation B - Nearly 50% have responded.<br></br>Situation C - Greater than 70-80% have…
Accessing robustness of a Logistic Model

Hi,

I've got a Logistic model built for a particular response-non response event.

The model suggests statistics that don't look like a robust model. I'm sharing those for more clarification..

No. of variables - around 5-8
c = 0.9
concordance = 0.93
H-L Chi square (Goodness of Fit)= 700 (P <<0.0001) (rejects Null - bad model characteristics)

Also, a univariate distribution of P(Y=1|X1..Xn) gives me 95% of the probabilities fall within 0.4!!! Which suggests that the model does poorer than a random!!

What are the ways to improve my model? I know of one or two methods that I surfed through recently, but none hands on.. Would like to hear any advice on this!

Thanks in advance.

Arun
Discriminant Analysis on Categorical Variables
I have a set of Independent Variables - both Categorical Variables and Continuous Variables. There is the predictor variable which have certain classes say C1 to Cn. The aim is to predict the category membership!<br />
<br />
I have a set of Independent Variables - both Categorical Variables and Continuous Variables. There is the predictor variable which have certain classes say C1 to Cn. The aim is to predict the category membership!<br />
<br />
I'm facing two issues. Any discriminant procedure requires only continuous variables for prediciting. And second, logistic regression which can be used produces probability values of category membership, which does not equivalently specify the inter-class variance using distance measures like a Canonical Discriminant Analysis does using %plotit macro.<br />
<br />
Hence, I've got two questions.<br />
1. If I've got mixed variables - both Continuous & Catergorical, can I still predict membership of category in the predictor variable? If yes, how?<br />
2. If the answer to the above is to use Logistic Regression or Genmod/Catmod, can I still obtain a plot of the various observations that are governed by the category in a distance measure plot to find out the between category variance/distance and hence understand visually what is the scenario of the categories.<br />
<br />
