# AnalyticBridge

A Data Science Central Community

Hi,

Could anybody please clarify one of my doubt. Suppose my dependent variable is dummy, so I've to use probit or logit model. Now, how I'll decide whether I should use logit or probit model.

Regards.
Arijit

Views: 41951

### Replies to This Discussion

Arijit

I'm not sure what you mean by "my dependent variable is dummy". Could you add some clarification?

In my experience, the logit and probit models tend to produce extremely similar results and you usually need a lot of data in the tails to notice a difference in fit (if you superimpose the response curves from the two models you will see that they are almost identical). The difference in application of the two approaches is mostly down to which has historically been used in the particular area of research and how you want to use the results. For example, if you want to quote odds ratios, then the logit approach makes more sense.

If you are dealing with multivariate data, here is a paper you may find interesting: http://home.gwu.edu/~soyer/mv1h.pdf

Matt
Hi Matt,

Thank you for your help. Let me give you the detail of the model I was trying to build. I work as a web analyst, & for one of my client I wanted to know what are the factors that is affecting "higher" sale of a product. So, first I made 2 segment within the products sold based on Google Analytics calculated \$ Index value. The higher than average \$ Index value is termed as success & marked by 1 & lower than average is termed as failure & marked by 0. So when I was saying my dependent varaibles are dummy, I wanted to mean that they take only 1 & 0. So when I was working on this regression model this particular question came to my mind (the one I originally posted earlier) as I wasn't getting any reference on them. I think you helped me to clarify my question to some extent (especially the paper was quite interesting).
There is just one more point I want a bit more clarification. What do you mean by "you usually need a lot of data in the tails to notice a difference in fit ". How do I know from looking at the data whether the quoted statement is true or not. Do I've to check by running both logit & probit regression. Or do I just check the distribution curve?

Regards.
Arijit
Arijit

In their raw form, all of your observations are either 0 or 1, which are discrete groups, so my statement concerning "data in the tails" may not make immediate sense. However, what the probit/logit models actually do is to model a continuous probability of group membership, using one of those two sigmoid curves. Hence, for an individual observation, the model will return a value somewhere between 0 and 1, which lies somewhere on that curve.

By "tails" I mean the part of the probability curve closest to the extremes and it is only in this region that you can really see a difference between the two methods. To illustrate further, think of a simple example where you have just one predictor variable, which is an ordinal categorical variable with multiple levels, where the higher the level, the greater the probability of a positive response. At each one of these levels you can calculate the proportion of observations with positive responses and plot these proportions against the predictor. Hopefully, you will see an s-shaped curve with lower and upper asymptotes at 0 and 1, respectively. Your probit/logit analysis will fit a curve through these proportions and you would need a lot of groups with probabilities close to 0 or 1 to be able to detect a difference in fit between the two models. It is not so easy to visualise the problem with other predictor types (continuous, dichotomous, non-ordinal categorical) but the principle is the same. In practice, the only way you can tell whether one model is better than the other is by fitting both and examining the results to see if one gives a better fit to the data but I would be surprised if you could detect such a difference in most cases. Instead, I would usually base the choice on what you want from the analysis and which is easiest to explain/justify to your clients.

Hope this helps

Matt
Hi Matt,

Thanks for your reply. I took time to reply your mail as I was tried to understand your reply. I won't say I've understood everything, but it helpede me to clear lot of points. Thank you for all your help.

Regards.
Arijit
Hi -

Probit assumes a Normal distribution, while Logit assumes a Log distribution of your data set. The reason the results are similar is because the sample size you use is "large" - do the same thing with a smaller dataset and you'll see a distinct difference.

Hi,

I am now studying an economic binary dependent variable with the Logit Regression Analysis. My data is large (N = 340.000) but the yes cases are only about 1% of the data, so the goodness of fit of my model is very low, mainly because of that. Could you please help me understanding if there are another Binary Regression Models that I should use to obtain better results? Or do you think I should transform my data, or do you have any other idea for this type of situation?... :)

Thank you!

Teresa

The standard approach to have atleast 2% response rate in your data. You can do boosting here e.g Oversampling is an approach where one can increase the response rate by repeating the no. of rows of responders to a considerable level resulting in increased response rate.

or

Also you can run other decision tree techniques to remove some nonresponders by not considering a segment of customers which will help you increase your response rate.