5.1 Sampling: The Basics

This Chapter can be difficult to follow, but within this Chapter are some key concepts. Don’t get lost in the details or formulas. Focus on the big picture. There are many books on sampling and this Chapter will not make you an expert within the field of survey sampling and data collection, but it will get you started.

Focus on the following key concepts that will be covered:

  1. Garbage In Garbage Out (G.I.G.O.)
  2. Sampling Error
  3. Non-Sampling Error
    • Selection Bias
    • Measurement Error
  4. Unbiased
  5. Central Limit Theorem
_ _

Often a subset of data is collected from a larger group in order to learn about the larger group. This is the basis of sampling. We wish to take a subset of data, sample, and use this sample to learn about what is called a target population. To learn about the target population, certain population quantities of interest are estimated from the sample.

Terminology and Notation

_ _
  1. Target population
  2. Sample
  3. Sampling frame
  4. Sampling design
  5. Sampling Error
  6. Non-Sampling Error

Typical Population Quantities of Interest

_ _
Population Mean :
        N
    -1-∑
μ = N      yi
       i=1

Population Total :
      N
     ∑
τ =      yi = N ⋅ μ
     i=1

Finite Population Variance :
            ∑N
σ2 =  --1---    (y - μ )2
      N -  1      i
             i=1

Population Standard Deviation :
     √--2
σ =   σ

Population Proportion :
       ∑N
π = -1-    zi,
    N  i=1

where

     {
       1, if units i satisfies the specified condition of interest.
zi =
       0,Otherwise.

There are a various ways to take a sample and collect data. Some ways are better than others. Often a sample is taken by those with little knowledge on sampling leading to various data analysis issues. Garbage in garbage out (G.I.G.O.) is very important to think about before sampling. If a sample is taken properly, often simple statistics can yield great insight. On the other hand, if the sample is not taken properly, simple statistics and advanced statistical techniques may not yield insight, and can actually yield misleading information. Unfortunately it is easy to understand the cost of sampling but not the importance of sampling. Sampling can be very costly and as a result sampling is often driven by financial concerns, as mentioned the cost of sampling is the easiest to understand.

Many samples are collected by convenience sampling. A convenience sample is collected by convenience. Convenience samples are not very costly, but the information obtained can be very unreliable. An extreme example of a convenience sample is collecting data on ones friends to learn about the target population, say people living in Bangkok. A more realistic example of a convenience sample, is collecting data from people at various malls in Bangkok on weekends to learn about people in Bangkok. Many statistical techniques you have learned and will learn do not apply to a convenience sample. It is easy to understand the number of observations one can obtain through convenience sampling versus more complicated sampling techniques. As with many things, in sampling quality is more important than quantity, again think G.I.G.O. Better to collect fewer observations through proper sampling techniques than convenience sampling.

Most statistical techniques do apply to samples taken by simple random sampling with replacement (SRSWR) and simple random sampling without replacement (SRSWOR) from a sufficiently large population. In SRSWR it is possible to observe the same unit (e.g. person) more than once in the sample. In SRSWOR, once a unit is selected to be in the sample it is removed from the list of units for subsequent selections. Thus in SRSWOR, it is not possible to sample the same unit more than once. SRSWR and SRSWOR are two sampling designs that involve probability sampling. A probability sample is a sample in which the units in the target population have a specified probability of being selected. In addition the probability of the sample, s, being observed is independent of the y-values in the population, y, that is, P(s|y) = P(s). For a SRSWR and SRSWOR the units are sampled with equal probability, they are types of equal probability sampling. Unequal probability sampling is when the units are selected with different probabilities. There are various probability sampling designs:

This chapter will not go into depth of the various sampling designs mentioned above, but will focus more on the fundamental concepts within sampling. The major benefit of taking a probability sample is that it is most reliable for extrapolating results to the target population of interest. In addition, of lessor benefit, often there exists an unbiased estimator, ˆθ , for the population quantity of interest, θ. Most commonly, the population quantity of interest, θ, is the population mean, μ or the population total, τ. An unbiased estimator is an estimator that has an expected value equal to the parameter of interest. A (sampling) design unbiased estimator is such that
       ∑
E [ˆθ] =    P (s)ˆθ = θ,
        s∈S
(5.1)

where S denotes the collection of all possible samples. Note: In this Chapter and when referring to sampling, expectation will be considered taken over all possible samples. The equation ?? can be viewed in a similar manner to equation ??, E[X] = pixi = μx, where ˆθ is analogous to xi and P(s), the probability of observing a specific sample which yields θˆ , is analogous to pi, the probability of observing xi. Equation ?? is the expectation of a random variable, X, and equation ?? is the expectation of an estimator, ˆθ.

For a biased estimator, the expectation of ˆθ does not equal θ, i.e. E[ˆθ ]θ. The bias of ˆθ is defined as the difference between the expectation of ˆ
θ and the population quantity of interest θ,

Bias(ˆθ) = E (ˆθ) - θ.

One factor in determining which sampling design and which estimator to use to estimate θ, is the mean square error, MSE, of the sampling design and the estimator. The MSE of θˆ equals the expected squared error,

MSE(ˆθ ) = E(ˆθ- θ)2
= E( ˆ     ˆ       ˆ    )
  θ - E(θ) + E (θ) - θ2
= E[(         )2   (         )2                         ]
   ˆθ - E (θˆ)   +   E (θˆ) - θ  + 2(ˆθ - E (ˆθ))(E(ˆθ) - θ)
= E[         ]
  ˆ     ˆ
 θ - E (θ)2 + E[         ]
    ˆ
 E (θ) - θ2 + 0
= Var( ˆ
θ ) + [     ˆ ]
 Bias(θ)2
_ _

The MSE for an unbiased estimator equals the variance since the bias of the estimator equals zero.