How To Determine If A Sample Is Representative - AnalyticBridge2020-07-04T22:03:00Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/how-to-determine-if-a-sample-is-representative?commentId=2004291%3AComment%3A203877&feed=yes&xn_auth=noOr even the mean of sample an…tag:www.analyticbridge.datasciencecentral.com,2017-05-24:2004291:Comment:3642052017-05-24T15:44:15.463ZVarun Bhargavahttps://www.analyticbridge.datasciencecentral.com/profile/VarunBhargava
Or even the mean of sample and population can work I guess
Or even the mean of sample and population can work I guess Can't we just plot a scatter…tag:www.analyticbridge.datasciencecentral.com,2017-05-24:2004291:Comment:3643882017-05-24T15:43:37.913ZVarun Bhargavahttps://www.analyticbridge.datasciencecentral.com/profile/VarunBhargava
Can't we just plot a scatter plot of population and samples and visually confirm that sample selected for training is close to the population.<br />
And is it necessary to get rid of the outliers ?
Can't we just plot a scatter plot of population and samples and visually confirm that sample selected for training is close to the population.<br />
And is it necessary to get rid of the outliers ? The answer is: No. No sample…tag:www.analyticbridge.datasciencecentral.com,2012-07-24:2004291:Comment:2038772012-07-24T20:12:49.373ZLynne Mysliwiechttps://www.analyticbridge.datasciencecentral.com/profile/LynneMysliwiec
<p>The answer is: No. No sample is guaranteed to be representative of the entire population, although the risk of non-representative samples is reduced as sample size / total N gets larger. The larger the % of the total population, the lower the risk of a non-representative sample. The smaller the sample, the higher the risk.</p>
<p>To ensure a representative sample, I usually generate base statistics for the entire population (or get them from a trusted source like the Census Bureau). Then,…</p>
<p>The answer is: No. No sample is guaranteed to be representative of the entire population, although the risk of non-representative samples is reduced as sample size / total N gets larger. The larger the % of the total population, the lower the risk of a non-representative sample. The smaller the sample, the higher the risk.</p>
<p>To ensure a representative sample, I usually generate base statistics for the entire population (or get them from a trusted source like the Census Bureau). Then, I use key attributes that are general predictors of response, or cluster membership, or customer value as my short list of profiling attributes. </p>
<p>I want to make sure that a population sample is:<br/>- Geographically representative<br/>- Demographically representative (age, gender, family position, occupation)<br/>- Has the same household characteristics (wealth, net worth, income, RANGE of income, RANGE of home value, RANGE of net worth, home value, home ownership rate)<br/>- If customers, contains the same % of new vs. ongoing customers, has customers from all value groups, single-product vs. multi-product buyers</p>
<p>Outliers: Understand who your outliers are. I generally try to avoid grabbing ANY of my outliers in a sampling situation. Outliers generate bias. They will throw off all your sample means & generally will mess up your conclusions.</p>
<p>You can calculate the total variance of your sample from the population distributions fairly easily in a spreadsheet -- a good sample doesn't vary substantially.</p>
<p>If I analyze a population on a regular basis, setting up a program that generates a base profile on a periodic basis and then running the same program against any sample and comparing the results will ensure that the conclusions you draw on samples will be the same as the conclusions you'd draw if you analyzed the entire base.</p> Social Research, where measur…tag:www.analyticbridge.datasciencecentral.com,2012-07-20:2004291:Comment:2031632012-07-20T05:45:11.871ZSean Flaniganhttps://www.analyticbridge.datasciencecentral.com/profile/SeanFlanigan
<p>Social Research, where measures are not massive in certain studies, such as jury bias etc. Pharmaceutical research, where cost of data collection can be astronomical. Both are self selected (voting registration, lifestyle or genetics (we can debate that one later)), which is a problem for generalizing but none the less if we restrict the domain of generalization some the inferences may hold water.</p>
<p></p>
<p>Social Research, where measures are not massive in certain studies, such as jury bias etc. Pharmaceutical research, where cost of data collection can be astronomical. Both are self selected (voting registration, lifestyle or genetics (we can debate that one later)), which is a problem for generalizing but none the less if we restrict the domain of generalization some the inferences may hold water.</p>
<p></p> Yes, a simple random sample h…tag:www.analyticbridge.datasciencecentral.com,2012-06-04:2004291:Comment:1934532012-06-04T23:37:16.435ZMatthew A. Riebelhttps://www.analyticbridge.datasciencecentral.com/profile/MatthewARiebel
<p>Yes, a simple random sample has all the desired characteristics (mean and variance of statistics, etc.) of a representative sample of the population. But in practice, it is almost always virtually impossible to come up with a sample that is exactly a simple random sample.</p>
<p>Yes, a simple random sample has all the desired characteristics (mean and variance of statistics, etc.) of a representative sample of the population. But in practice, it is almost always virtually impossible to come up with a sample that is exactly a simple random sample.</p> People have discussed the abi…tag:www.analyticbridge.datasciencecentral.com,2012-06-03:2004291:Comment:1931212012-06-03T23:58:31.501ZVincent Cottehttps://www.analyticbridge.datasciencecentral.com/profile/VincentCotte
<p>People have discussed the ability to do away with sampling given the advancements in technology. AKA the reduction of storage cost and the evolution of high performance analytics. Therefore time-to-insight is no longer a barrier.</p>
<p>What do you think are the examples where sampling is still the preferred approach</p>
<p>People have discussed the ability to do away with sampling given the advancements in technology. AKA the reduction of storage cost and the evolution of high performance analytics. Therefore time-to-insight is no longer a barrier.</p>
<p>What do you think are the examples where sampling is still the preferred approach</p> If you know the population di…tag:www.analyticbridge.datasciencecentral.com,2012-06-01:2004291:Comment:1931682012-06-01T16:54:18.992ZAntonio Irpinohttps://www.analyticbridge.datasciencecentral.com/profile/AntonioIrpino
<p>If you know the population distribution you can check if the sample distribution is similar to that of the population (with a non parametric test), but this is tautological. Anyway is the correct application of the sampling schema that in general guarantees representativeness. For example, the Central Limit Theorem and the Glivenko-Cantelli theorem are at the base of the definition of the most common probabilistic sampling schemas including the rules for calculating the sample size. </p>
<p>If you know the population distribution you can check if the sample distribution is similar to that of the population (with a non parametric test), but this is tautological. Anyway is the correct application of the sampling schema that in general guarantees representativeness. For example, the Central Limit Theorem and the Glivenko-Cantelli theorem are at the base of the definition of the most common probabilistic sampling schemas including the rules for calculating the sample size. </p>