A Data Science Central Community
If you know the population distribution you can check if the sample distribution is similar to that of the population (with a non parametric test), but this is tautological. Anyway is the correct application of the sampling schema that in general guarantees representativeness. For example, the Central Limit Theorem and the Glivenko-Cantelli theorem are at the base of the definition of the most common probabilistic sampling schemas including the rules for calculating the sample size.
People have discussed the ability to do away with sampling given the advancements in technology. AKA the reduction of storage cost and the evolution of high performance analytics. Therefore time-to-insight is no longer a barrier.
What do you think are the examples where sampling is still the preferred approach
Social Research, where measures are not massive in certain studies, such as jury bias etc. Pharmaceutical research, where cost of data collection can be astronomical. Both are self selected (voting registration, lifestyle or genetics (we can debate that one later)), which is a problem for generalizing but none the less if we restrict the domain of generalization some the inferences may hold water.
Yes, a simple random sample has all the desired characteristics (mean and variance of statistics, etc.) of a representative sample of the population. But in practice, it is almost always virtually impossible to come up with a sample that is exactly a simple random sample.
The answer is: No. No sample is guaranteed to be representative of the entire population, although the risk of non-representative samples is reduced as sample size / total N gets larger. The larger the % of the total population, the lower the risk of a non-representative sample. The smaller the sample, the higher the risk.
To ensure a representative sample, I usually generate base statistics for the entire population (or get them from a trusted source like the Census Bureau). Then, I use key attributes that are general predictors of response, or cluster membership, or customer value as my short list of profiling attributes.
I want to make sure that a population sample is:
- Geographically representative
- Demographically representative (age, gender, family position, occupation)
- Has the same household characteristics (wealth, net worth, income, RANGE of income, RANGE of home value, RANGE of net worth, home value, home ownership rate)
- If customers, contains the same % of new vs. ongoing customers, has customers from all value groups, single-product vs. multi-product buyers
Outliers: Understand who your outliers are. I generally try to avoid grabbing ANY of my outliers in a sampling situation. Outliers generate bias. They will throw off all your sample means & generally will mess up your conclusions.
You can calculate the total variance of your sample from the population distributions fairly easily in a spreadsheet -- a good sample doesn't vary substantially.
If I analyze a population on a regular basis, setting up a program that generates a base profile on a periodic basis and then running the same program against any sample and comparing the results will ensure that the conclusions you draw on samples will be the same as the conclusions you'd draw if you analyzed the entire base.