This Chapter can be difficult to follow, but within this Chapter are some key concepts. Don’t get lost in the details or formulas. Focus on the big picture. There are many books on sampling and this Chapter will not make you an expert within the field of survey sampling and data collection, but it will get you started.
Focus on the following key concepts that will be covered:
|
Often a subset of data is collected from a larger group in order to learn about the
larger group. This is the basis of sampling. We wish to take a subset of data, sample, and
use this sample to learn about what is called a target population. To learn about the
target population, certain population quantities of interest are estimated from the
sample.
|
Terminology and Notation |
This notation is often used without explanation.
|
Typical Population Quantities of Interest |
where
There are a various ways to take a sample and collect data. Some ways are better than others. Often a sample is taken by those with little knowledge on sampling leading to various data analysis issues. Garbage in garbage out (G.I.G.O.) is very important to think about before sampling. If a sample is taken properly, often simple statistics can yield great insight. On the other hand, if the sample is not taken properly, simple statistics and advanced statistical techniques may not yield insight, and can actually yield misleading information. Unfortunately it is easy to understand the cost of sampling but not the importance of sampling. Sampling can be very costly and as a result sampling is often driven by financial concerns, as mentioned the cost of sampling is the easiest to understand.
Many samples are collected by convenience sampling. A convenience sample is collected by convenience. Convenience samples are not very costly, but the information obtained can be very unreliable. An extreme example of a convenience sample is collecting data on ones friends to learn about the target population, say people living in Bangkok. A more realistic example of a convenience sample, is collecting data from people at various malls in Bangkok on weekends to learn about people in Bangkok. Many statistical techniques you have learned and will learn do not apply to a convenience sample. It is easy to understand the number of observations one can obtain through convenience sampling versus more complicated sampling techniques. As with many things, in sampling quality is more important than quantity, again think G.I.G.O. Better to collect fewer observations through proper sampling techniques than convenience sampling.
Most statistical techniques do apply to samples taken by simple random sampling with replacement (SRSWR) and simple random sampling without replacement (SRSWOR) from a sufficiently large population. In SRSWR it is possible to observe the same unit (e.g. person) more than once in the sample. In SRSWOR, once a unit is selected to be in the sample it is removed from the list of units for subsequent selections. Thus in SRSWOR, it is not possible to sample the same unit more than once. SRSWR and SRSWOR are two sampling designs that involve probability sampling. A probability sample is a sample in which the units in the target population have a specified probability of being selected. In addition the probability of the sample, s, being observed is independent of the y-values in the population, y, that is, P(s|y) = P(s). For a SRSWR and SRSWOR the units are sampled with equal probability, they are types of equal probability sampling. Unequal probability sampling is when the units are selected with different probabilities. There are various probability sampling designs:
This chapter will not go into depth of the various sampling designs mentioned above, but will
focus more on the fundamental concepts within sampling. The major benefit of taking
a probability sample is that it is most reliable for extrapolating results to the
target population of interest. In addition, of lessor benefit, often there exists an
unbiased estimator, , for the population quantity of interest, θ. Most commonly, the
population quantity of interest, θ, is the population mean, μ or the population
total, τ. An unbiased estimator is an estimator that has an expected value equal
to the parameter of interest. A (sampling) design unbiased estimator is such
that
|
| (5.1) |
where denotes the collection of all possible samples. Note: In this Chapter and when
referring to sampling, expectation will be considered taken over all possible samples. The
equation ?? can be viewed in a similar manner to equation ??, E[X] = ∑
pixi = μx, where
is analogous to xi and P(s), the probability of observing a specific sample which yields
,
is analogous to pi, the probability of observing xi. Equation ?? is the expectation
of a random variable, X, and equation ?? is the expectation of an estimator,
.
For a biased estimator, the expectation of does not equal θ, i.e. E[
]≠θ. The bias of
is defined as the difference between the expectation of
and the population quantity of
interest θ,
One factor in determining which sampling design and which estimator to use to
estimate θ, is the mean square error, MSE, of the sampling design and the estimator. The
MSE of equals the expected squared error,
|
The MSE for an unbiased estimator equals the variance since the bias of the estimator equals zero.