# AnalyticBridge

A Data Science Central Community

# sample size

hi friendz

i am a research scholar from Pondicherry University, i am having a doubt regarding the sample size. I have chosen a sample size of 30 for my paper but my friends are telling that always the sample size should be more than 50.But in one of the workshops i have attended the resource person told that we can do with 30 samples. So i am having a very big confusion in my mind. Could you able to clear my doubts.

Views: 1303

### Replies to This Discussion

Hi,

There are at least two considerations:
1. The sample size is very dependent on the experimental design, in that the more independent variables one has the more your degrees of freedom get partitioned. For an example of this you might want to look into Analysis of Variance designs and F-Tables.
2. The size of the "true" differences you are trying to measure and the quality of your measurements, in that, the smaller the differences and the poorer your measurement quality the larger your sample size will need to be in order to achieve statistically significant differences. In some experiments (say particle physics, sample sizes are HUGE, >10^10 ) .

If you use multivariate statistics for data analysis, say dealing with 15 variables then sample size should be greater than 3 times the number of variables i.e., more than 45. For single variable empirical data analysis, 30 samples is found to be okey.

I am also trying to get a "rule of thumb" for determining optimal sample size and density within sample.  I have a feeling there isn't a hard set rule for this but I am interested in hearing additional opinions.

Ideally from my point of view it seems you should see somewhere around a minimum density of 10% for your target (predicted) value from your entire sample.   Ultimately I would guess a 30-50% density is an ideal situation.  Post modeling, it seems you should balance your lift results along with the density values to understand the relationship between the two.

Can you please share your best successes with predictive modeling, how big a sample set along with what density or penetration of positive targets.  I have so far experiences relatively low successes with modeling marketing data - three different scenarios

#1, N=~9k, 200 positive events

#2, N=~500, 350 positive events

#3, N= ~700, 525 positive events

In my opinion all of these seem like poor samples to run regression models or decision trees on and don't lend to statistically significant representation to perform predictive modeling.  I welcome any and all feedback, as these initial sets were pre-determined and I want to avoid designing future analytics cases with such poor conditions.

What are the minimum requirements you will consider for sample size and positive # events???

The following formula allows to calculate the size of a sample by taking into account the proportion of the target population:

 N >= N*p*(1-p) P*(1-p) + l²*(N-1)/z²

With:

N = Size of the population

n = Size of the sample

p = Proportion to be estimated

l = Chosen margin of error

z = Level of confidence

So for

N = 2000

p = 0,91

l = 0,1

z = 1,95

we have n = 30,99

Thus by adjusting well the parameters we can reach the size of 30