What type of sampling is used when the population is dispersed over a wide geographic region and it is costly to gather a complete list of the members of the population?

Cluster sampling is a probability sampling technique where researchers divide the population into multiple groups (clusters) for research. Researchers then select random groups with a simple random or systematic random sampling technique for data collection and data analysis.

Select your respondents

Example: A researcher wants to conduct a study to judge the performance of sophomore’s in business education across the U.S. It is impossible to conduct a research study that involves a student in every university. Instead, by using cluster sampling, the researcher can club the universities from each city into one cluster. These clusters then define all the sophomore student population in the U.S. Next, either using simple random sampling or systematic random sampling and randomly pick clusters for the research study. Subsequently, by using simple or systematic sampling, the sophomore’s from each of these selected clusters can be chosen on whom to conduct the research study.

In this sampling technique, researchers analyze a sample that consists of multiple sample parameters such as demographics, habits, background – or any other population attribute, which may be the focus of conducted research. This method is usually conducted when groups that are similar yet internally diverse form a statistical population. Instead of selecting the entire population, cluster sampling allows the researchers to collect data by bifurcating the data into small, more productive groups.

Cluster sampling definition

Cluster sampling is defined as a sampling method where the researcher creates multiple clusters of people from a population where they are indicative of homogeneous characteristics and have an equal chance of being a part of the sample.

Example: Consider a scenario where an organization is looking to survey the performance of smartphones across Germany. They can divide the entire country’s population into cities (clusters) and select further towns with the highest population and also filter those using mobile devices.

Types of cluster sampling

There are two ways to classify this sampling technique. The first way is based on the number of stages followed to obtain the cluster sample, and the second way is the representation of the groups in the entire cluster. In most cases, sampling by clusters happens over multiple stages. A stage is considered to be the step taken to get to the desired sample. We can divide this technique into single-stage, two-stage, and multiple stages.

Single-stage cluster sampling: 

As the name suggests, sampling is done just once. An example of single-stage cluster sampling – An NGO wants to create a sample of girls across five neighboring towns to provide education. Using single-stage sampling, the NGO randomly selects towns (clusters) to form a sample and extend help to the girls deprived of education in those towns.

Two-stage cluster sampling: 

Here, instead of selecting all the elements of a cluster, only a handful of members are chosen from each group by implementing systematic or simple random sampling. An example of two-stage cluster sampling – A business owner wants to explore the performance of his/her plants that are spread across various parts of the U.S. The owner creates clusters of the plants. He/she then selects random samples from these clusters to conduct research.

Multiple stage cluster sampling: 

Multiple-stage cluster sampling takes a step or a few steps further than two-stage sampling.

For conducting effective research across multiple geographies, one needs to form complicated clusters that can be achieved only using the multiple-stage sampling technique. An example of Multiple stage sampling by clusters – An organization intends to survey to analyze the performance of smartphones across Germany. They can divide the entire country’s population into cities (clusters) and select cities with the highest population and also filter those using mobile devices.

Steps to conduct cluster sampling

Here are the steps to perform cluster sampling:

  1. Sample: Decide the target audience and also the sample size.
  2. Create and evaluate sampling frames: Create a sampling frame by using either an existing framework or creating a new one for the target audience. Evaluate frameworks based on coverage and clustering and make adjustments accordingly. These groups will be varied, considering the population, which can be exclusive and comprehensive. Members of a sample are selected individually.
  3. Determine groups: Determine the number of groups by including the same average members in each group. Make sure each of these groups are distinct from one another.
  4. Select clusters: Choose clusters by applying a random selection.
  5. Create sub-types: It is bifurcated into two-stage and multi-stage subtypes based on the number of steps followed by researchers to form clusters.

Applications of cluster sampling

This sampling technique is used in an area or geographical cluster sampling for market research. A broad geographic area can be expensive to survey in comparison to surveys that are sent to clusters that are divided based on region. The sample numbers have to be increased to achieve accurate results, but the cost savings involved make this process of rising clusters attainable.

Cluster sampling in statistics

The technique is widely used in statistics where the researcher can’t collect data from the entire population as a whole. It is the most economical and practical solution for statisticians doing research. Take the example of a researcher who is looking to understand the smartphone usage in Germany. In this case, the cities of Germany will form clusters. This sampling method is also used in situations like wars and natural calamities to draw inferences of a population, where collecting data from every individual residing in the population is impossible.

Cluster sampling advantages

There are multiple advantages to using cluster sampling. Here they are:

  • Consumes less time and cost: Sampling of geographically divided groups requires less work, time, and cost. It’s a highly economical method to observe clusters instead of randomly doing it throughout a particular region by allocating a limited number of resources to those selected clusters.
  • Convenient access: Researchers can choose large samples with this sampling technique, and that’ll increase accessibility to various clusters.
  • Data accuracy: Since there can be large samples in each cluster, loss of accuracy in information per individual can be compensated.
  • Ease of implementation: Cluster sampling facilitates information from various areas and groups. Researchers can quickly implement it in practical situations compared to other probability sampling methods.

In comparison to simple random sampling, tis technique can be useful in deciding the characteristics of a group such as population, and researchers can implement it without having a sampling frame for all the elements for the entire population.

Select your respondents

Cluster sampling vs stratified sampling

Since cluster sampling and stratified sampling are pretty similar, there could be issues with understanding their finer nuances. Hence, the major differences between cluster sampling and stratified sampling, are:

Cluster sampling Stratified sampling
Elements of a population are randomly selected to be a part of groups (clusters). The researcher divides the entire population into even segments (strata).
Members from randomly selected clusters are a part of this sample. Researchers consider individual components of the strata randomly to be a part of sampling units.
Researchers maintain homogeneity between clusters. Researchers maintain homogeneity within the strata.
Researchers divide the clusters naturally. The researchers or statisticians primarily decide the strata division.
The key objective is to minimize the cost involved and enhance competence. The key objective is to conduct accurate sampling, along with a properly represented population.

SAMPLE DESIGN

Introduction

When you have a clear idea of the aims of the survey, the particular data requirements, the degree of accuracy required, and have considered the resources and time available, you are in a position to make a decision on the size and form of the collection. The major concerns that should be addressed at this stage are:
  • defining the population, frame and units;
  • calculating the sample size;
  • determining the sampling methodology;
  • choosing an appropriate data collection method; and
  • determining the estimation method to be used.
The following discussion will give a brief introduction to some basic terms and ideas in sampling and an outline of sample designs commonly used. The main focus of the discussion will be on determining an appropriate sampling method.

PROBABILITY AND NON-PROBABILITY

Non-Probability Samples

If the probability of selection for each unit is unknown, or cannot be calculated, the sample is called a non-probability sample. Non-probability samples are often less expensive, easier to run and don't require a frame.

However, it is not possible to accurately evaluate the precision (ie. closeness of estimates under repeated sampling of the same size) of estimates from non-probability samples since there is no control over the representativeness of the sample. If a non-probability sample is carried out carefully, then the bias in the results can be reduced.

As it is dangerous to make inferences about the target population on the basis of a non-probability sample, non-probability methodology is often used to test aspects of a survey such as questionnaire design, processing systems etc. rather than make inferences about the target population. Different types of non-probability samples are discussed below.

Quota Sampling

To select a quota sample, the interviewers select respondents until a pre-determined number of respondents in certain categories are surveyed (eg. the interviewers might select the sample to achieve a certain age/sex breakdown reflective of the target population).

This is the method of sampling commonly used by market researchers and political pollsters as it can produce fairly good estimates if it is properly conducted. When top up units are selected randomly to fill a quota, and no element of judgment is used by the researcher for unit selection, it is very similar to a probability sample. However, when non-response is significant (which is almost always the case for voluntary surveys), quota sampling can under-represent those portions of the population that are unwilling to respond or hard to contact. This is of particular concern when the data items collected influence the likelihood of response. See also the section on Non-Response in Errors in Statistical Data for further details.

Convenience and Haphazard Sampling

Street corner interviews, magazine and newspaper questionnaires and phone-in polls are all examples of convenience or haphazard samples. These types of surveys are subject to biased or unrepresentative samples as only persons who feel strongly about the topic will respond. These surveys also have a tendency to ask questions that are loaded or have a biased wording. Street corner interviews can be biased depending on the timing and the placement of the interviewer. There is no control over selecting the sample of respondents in any of these methods, however they are very cheap and easy to administer.

Judgement or Purposive Sampling

Judgement sampling is where a 'representative' sample is chosen by an expert in the field of study. Judgement sampling is subject to unknown biases but may be justified for very small samples. This form of sampling can be used to choose a sample for a pilot test of a probability survey but inferences about the population should not be made from judgement samples. Judgement sampling is also known as purposive sampling.

Probability Samples

A probability sample is one in which every unit of the population has a known non-zero probability of selection and is randomly selected. A probability sample allows inferences about the target population to be made. By knowing the selection probability for each unit, objective selections can then be made which should produce a more representative sample. Known probabilities also allow the measurement of the precision of the survey estimates in terms of standard errors and confidence intervals. Probability samples require a frame for selection purposes and thus are relatively expensive in terms of operational costs and frame maintenance. The most common sampling techniques, such as simple random, systematic, stratified, multi-stage and cluster sampling, are all examples of probability samples. These will be looked at later in this chapter.

Choosing Between Probability and Non-Probability Samples

The choice between using a probability or a non-probability approach to sampling depends on a variety of factors:
  • the objectives and scope of the survey;
  • the method of data collection suitable to those objectives;
  • the precision required of the results and whether that precision needs to be able to be measured;
  • the availability of a sampling frame;
  • the resources required to maintain the frame; and
  • the availability of extra information about the units in the population.
Probability sampling is normally preferred when conducting major surveys, especially when a population frame is available ensuring that we are able to select and contact each unit in the (frame) population. However, where time and financial constraints make probability sampling infeasible, or where knowing the level of accuracy in the results is not an important consideration, non-probability samples do have a role to play since they are inexpensive, easy to run and no frame is required. For this reason, when conducting qualitative (investigative), rather than quantitative research, non-probability samples & techniques such as case studies are generally superior to probability samples & quantitative estimation. Non-probability sampling can also be useful when pilot testing surveys. If a non-probability sample is carried out carefully, then the bias in the results can be reduced. Note that with non-probability methods it is dangerous to make inferences about the whole population. Quota sampling may be appropriate when response rates are expected to be low. True probability sampling would be more expensive and may require top up units to be selected. If quota sampling is used, selection of units should be as random as possible and care should be taken to avoid introducing a bias. Unlike certain non-probability samples, probability sampling involves a random selection of units. This allows us to quantify the standard error of estimates and hence allow confidence intervals to be formed and hypotheses to be formally tested. The main disadvantages with probability sampling involve cost, such as the costs involved with frame maintenance and surveying units which are difficult to contact.

SIMPLE RANDOM SAMPLING

Simple random sampling (SRS) is a probability selection scheme where each unit in the population is given an equal probability of selection, and thus every possible sample of a given size has the same probability of being selected. One possible method of selecting a simple random sample is to number each unit on the sampling frame sequentially, and make the selections by generating "selection numbers" from a random number table or, from some form of random number generator.

With (SRSWR) and Without (SRSWOR) Replacement

Simple random sampling can involve the units being selected either with or without replacement. With replacement sampling allows the units to be selected multiple times whilst without replacement only allows a unit to be selected once. Without replacement sampling is by far the more commonly used method.

Advantages

The advantage of simple random sampling lies in its simplicity and ease of use, especially when only a small sample is taken.

Disadvantages

Simple random sampling does, however, require a complete list of all population units as each unit needs to have a unique number associated with it to enable random selection. This sampling scheme also becomes unwieldy for large sample sizes and can be expensive if the sample is spread over a wide geographic area. In practice, simple random sampling is rarely used because there is almost always a more efficient method of designing the sample (in terms of producing accurate results for a given cost). Nevertheless, simple random sampling forms the basis of a number of the more complex methods of sample design, and is used as a benchmark to which other designs are compared.

USE OF AUXILIARY INFORMATION

We have discussed methods of drawing simple random samples. There is no point in using any other kind of sample selection method if one knows no more about the population to be sampled than the existence of each of the units in the population. However, some further information is often available about each of the population units from simple observation or data from a previous study (for example, a census). This information can be in the form of demographic variables such as age, sex and income or geographical/business types such as industry, employment, state, region and sector (private or public). This further information about the population units is called auxiliary information. Such information can be used in the selection and the estimation process to obtain more accurate estimates or reduce costs. Sampling techniques using this information include systematic sampling, stratified sampling (including post-stratification) and cluster sampling. Auxiliary information is also used in estimation techniques such as ratio and regression estimations. One of the major aspects of sample design is the efficient use of auxiliary or supplementary information. As there is often a cost involved in obtaining auxiliary information, it is necessary to quantify the gains that are obtained through using auxiliary information and balance it against the cost of acquisition.

SYSTEMATIC SAMPLING

Systematic sampling provides a simple method of selecting the sample when the sampling frame exists in the form of an explicit list. Where the frame contains auxiliary information then the units in the frame are ordered with respect to that auxiliary data (eg employment size of a business). A fixed interval (referred to as the skip) is then used to select units from the sampling frame. Systematic sampling is best explained by describing how the sample selections are made.

Method

Assume that the population has N units (eg. 37) and that we wish to select a sample of size n (eg. 5). The list of N population units is ordered in some way. The steps taken in selecting a systematic random sample are:
  1. Calculate the skip interval k = N/n.
  2. Choose a random start, r, between 1 and k.
  3. Select the rth unit in the list and every kth unit thereafter:
r, r+k, r+2*k, r+3*k,..., r+(n-1)*k. The value of k is usually not an integer. In this case we either
  • round k to the nearest integer;
  • keep k a fraction and round r+i*k (where i =0,...,n-1) to the nearest integer.

Example: Calculating the Skip Interval Say that we wanted to take a systematic sample of size 5 from a population of 37 units. The sample size does not divide evenly into the population. The two options for coping with this are discussed below. Order the population units in some way and number them from 1 to 37.
      N = 37 n = 5

      k = 37/5 = 7.4

1. Round k to the nearest integer
      k = 7 r = 4 Then the sample units are : 1st unit=4, 2nd unit=4+7=11, 3rd unit=4+14=18, 4th unit=4+21=25 and 5th unit=4+28=32. Therefore, sample = ( 4, 11, 18, 25, 32)
2. Round r+ik to the nearest integer
      k = 7.4 r = 4.2

      sample = ( 4.2, 11.6, 19, 26.4, 33.8) = ( 4, 12, 19, 26, 34)


Features of Systematic Random Sampling The usefulness of systematic random sampling depends upon the strength of the relationship between the variable of interest and the benchmark variable/s. The more highly correlated they are, the greater the gains in accuracy achieved over simple random sampling. This is because we are ensuring a more representative sample of population units are selected. If there is a strong relationship between the variable of interest and the benchmark variable/s then ordering the list by the variable of interest will yield more accurate results using systematic sampling than simple random sampling. Systematic random sampling using ordered lists ensures a range of units will be selected in the sample.

Advantages

    Systematic random sampling is often easier to use than simple random sampling, especially for large samples, as only one random number (the random start) is required, rather than a random number for every unit as required for SRS.
    Systematic random sampling is a without replacement sampling scheme and usually gives more accurate results (lower standard errors) than simple random samples of the same size due to the closer control over selection. This is particularly the case if the ordering of the units in the list is related to characteristics of the variable of interest.

Disadvantages
  • Periodicity bias If, after ordering, the variable of interest is periodic/cyclic in nature then it is possible to obtain an estimate which is less accurate than a simple random sample if the periodicity coincides with the skip interval. For example, daily sales in a supermarket expect to peak on weekends. If the skip is calculated as 7, a bias is introduced yielding samples which are not representative of the population.
  • Complete list of the population

    To perform the ordering, a complete list of the population is required.

Stratified Sampling


Stratified sampling is a technique which uses auxiliary information which is referred to as stratification variables to increase the efficiency of a sample design. Stratification variables may be geographical (eg. state, rural/urban) or non-geographical (eg. age, sex, number of employees). Stratified sampling involves
  • the division or stratification of the population into homogeneous (similar) groups called strata; and
  • selecting the sample using SRS or systematic sampling within each stratum and independent of the other strata.
Stratification almost always improves the accuracy of estimates. This is because the population variability can be thought of as having components within strata and between strata. By independently sampling within each stratum we ensure each stratum is appropriately reflected in the sample, so between stratum variability is eliminated and we are left only with the within stratum component. With this factor in mind we see that the most efficient way to stratify is to have strata which are as different from each other as possible (to maximise the variance which is being eliminated) while being internally as homogeneous as possible (to minimise the variance remaining).

Practical Considerations

When planning a stratified sample, a number of practical considerations should be kept in mind:
  • the strata should be designed so that they collectively include all members of the target population;
  • each member must appear in only one stratum, ie strata should be non overlapping; and
  • the definitions of boundaries of the strata should be precise and unambiguous.

Example of Stratification As an example of stratification, if we were interested in the educational background of members of a Science faculty at a University, we could select a sample from the faculty as a whole or select samples independently from each of the departments within the faculty, such as mathematics, physics, chemistry etc. This latter method would ensure that each department was adequately represented (which would not necessarily happen otherwise), and should increase the precision of the overall estimate. If on the other hand, we were interested in the level of education (PhD, Masters, Bachelor) rather than the background we should stratify the faculty by level (Professor, Senior Lecturer, Lecturer) rather than by the department. Using this stratification we are more likely to find uniformity of educational standards within a level rather than an area of work, and we are also more likely to separate the better qualified from the less qualified.

Advantages

The four main benefits of stratified sampling are:
  • minority groups of interest or high variability can be oversampled. A greater proportion of units can be selected from minority groups than the majority group.
  • the results are more accurate. Sampling error is reduced because of the grouping of similar units. It should be remembered that there is no gain in accuracy from stratifying by a factor unrelated to the subject of the survey.
  • different selection and estimation procedures can be applied to the various strata; and
  • separate information can be obtained about the various strata. Stratification also permits separate analyses on each group and allows different interests to be analysed for different groups.

Disadvantages

  • an increase in costs;
  • a danger of stratifying too finely;
  • availability of auxiliary information (however, it is possible to form a stratum of units for which no information is available)

Number of Strata There is no rule as to how many strata the population should be divided into. This depends on the population size and homogeneity and the format in which the output is required. If output is required for some sub-groups of the population these subgroups must be considered as separate strata.

ABS Surveys

All surveys conducted by the Australian Bureau of Statistics employ stratification. Household surveys (such as the Monthly Population Survey and the Household Expenditure Survey) use geographic strata. Business surveys use variables such as state and industry strata and use some measure of size (eg employment) to form size strata.

Allocation of Sample

An important consideration after deciding on the appropriate stratification is the way in which the total sample is to be allocated to each stratum. There are three common methods of calculating the number of units required from each stratum.
  • Equal Allocation
    Equal allocation is the simplest form of allocation. It involves selecting the same number of units from each stratum. This method is rarely used as it does not take into account the size of each stratum or the variability within each strata. However, equal allocation does ensure approximately equal reliability across all strata (assuming that the sample is not a significant proportion of the population [no more than 10% of the population would be reasonable]).
  • Proportional Allocation
    In proportional allocation, the sample allocated to each stratum is proportional to the number of units in the strata. Say we were sampling 10% of the population; we would then sample 10% of each stratum. This method takes into account the size of each stratum; larger strata will have larger samples taken from them. It does not however allow for the differences in the variability within each strata. This method is used when no information is available on the variability within the stratum (stratum variances) or where there is no major difference in variability between the strata.
  • Optimal Allocation This allocation method aims to minimise the standard error of the estimate across the population. This is achieved by allocating a relatively large sample to those strata which are highly variable, and a relatively small sample to those strata which have low variability. Variability is based on some auxiliary information available for the stratum, for example, previous survey results or pilot test data. Optimal allocation can also be used to account for differing costs of sampling between the strata, ie. minimise the standard error for a given cost.
Another method of sample selection is to have a completely enumerated stratum in our sample. This is where the units that contribute significantly to our estimates are placed in a single stratum and every unit within the stratum is then selected.

Estimation

As samples are selected independently from each stratum, estimates are also usually made separately for each stratum, then added to give the overall estimate (eg estimated unemployment for Australia will be the sum of the state unemployment estimates). Similarly, standard errors or variances (measures of sample variability) are calculated for each stratum and then all strata specific variances are added up to obtain the overall variance. The addition of variances is possible because the sample is selected independently from each strata. This overall variance can then be used to calculate an overall standard error.

Post Stratification

There will be occasions when we may like to stratify by a certain variable, say age or sex, but we cannot because we do not know the age and sex of our population units until we select them. Post-stratification is a method used when stratification is not possible before the survey. The stratification variable can then be used after the survey is conducted, to improve the efficiency of estimates or, to obtain estimates corresponding to different categories of that variable (eg. sex) by stratifying the sample as if the benchmark information had been available previously.

CLUSTER AND MULTI-STAGE SAMPLING

So far we have considered a number of ways which a sample of population units can be selected and population characteristics estimated on the basis of this sample. In this section consideration is given to a sampling scheme where the selection of population units is made by selecting particular groups (or clusters) of such units and then selecting all or some of the population units within selected groups for inclusion in the sample.

Cluster Sampling

Cluster sampling involves selecting a sample in a number of stages (usually two). The units in the population are grouped into convenient, usually naturally occurring clusters. These clusters are non-overlapping, well-defined groups which usually represent geographic areas. At the first stage of selection, a number of clusters are selected. At the second stage, all the units in the chosen clusters are selected to form the sample.

Practical Considerations

  • The clusters should be designed so that they collectively include all members of the target population;
  • each member must appear in one and only one cluster; and
  • the definitions or boundaries of the clusters should be precise and unambiguous; in the case of geographical clusters natural and man-made boundaries such as rivers and roads are often used to delimit the cluster boundaries.

Advantages Cluster sampling involves selecting population units that are "close" together and does not require all the population units to be listed. Cluster sampling has two advantages:
  • it eliminates the need for a complete list of all units in the population; and
  • it ensures that selected population units will be closer together, thus enumeration costs for personal interviews will be reduced, and field work will be simplified.

Disadvantages In general, cluster sampling is less accurate than SRS (for samples of the same size) because the sample obtained does not cover the population as evenly as in the case of SRS. However it is often preferred because it is more economical. For example, if we take a simple random sample of 10,000 households across the whole of Australia then we are more likely to cover the population more evenly, but it is more expensive than sampling 50 clusters of 200 households.

Multi-stage Sampling

Multi-stage sampling involves selecting a sample in at least two stages. At the first stage, large groups or clusters of population units are selected. These clusters are designed to contain more units than are required for a final sample. At the second stage, units are sampled from the selected clusters to derive the final sample. If more than two stages are used, the process of selecting "sub-clusters" within clusters continues until the final sample is achieved. The same practical considerations apply to multi-stage sampling as to the cluster sampling.

Example: A Three-Stage Sample

The following is an example of the stages of selection that may be used in a three-stage household survey.
  • Stage 1. Electoral Subdivisions
    Electoral subdivisions (clusters) are sampled from a city or state.
  • Stage 2. Blocks
    Blocks of houses are selected from within the electoral subdivisions.
  • Stage 3. Houses
    Houses are selected from within the selected blocks.

Uses of Multi-stage Sampling
Multi-stage sampling is generally used when it is costly or impossible to form a list of all the units in the target population. Typically, a multi-stage sample gives less precise estimates than a simple random sample of the same size. However, a multi-stage sample is often more precise than a simple random sample of the same cost, and it is for this reason that the method is employed.

Advantages and Disadvantages

The advantages and disadvantages of multi-stage sampling are similar to those for cluster sampling. However, to compensate for the lower accuracy, either the number of clusters selected in the first stage should be relatively large (but this increases the cost of the survey) or the sampling fraction for later stages should be high (i.e. a large percentage of each cluster should be selected).

SAMPLE SIZE ISSUES AND DETERMINATION

An important aspect of sample design is deciding upon the sample size given the objectives and constraints that exist. Since every survey is different there are no fixed rules for determining sample size. However, factors to be considered include
  • the population size and variability within the population;
  • resources (time, money and personnel);
  • level of accuracy required of the results;
  • level of detail required in the results;
  • the likely level of non-response;
  • the sampling methods used; and
  • relative importance of the variables of interest
Once these issues have been addressed, you are in a better position to decide on the size of the sample.

Variability

The more variable the population is, the larger the sample required to achieve specific levels of accuracy. However, actual population variability is generally not known in advance; information from a previous survey or a pilot test may be used to give an indication of the variability of the population. When the characteristic being measured is comparatively rare, a larger sample size will be required to ensure that sufficient units having that characteristic are included in the sample.

Population Size

An aspect that affects the sample size required is the population size. When the population size is small, it needs to be considered carefully in determining the sample size, but when the population size is large it has little effect on the sample size. Gains in precision from increasing the sample size are by no means proportional to population size.

Resources and Accuracy

As discussed earlier, the estimates are obtained from a sample rather than a census, therefore the estimates are different to the true population value. A measure of the accuracy of the estimate is the standard error. A large sample is more likely to have a smaller standard error or greater accuracy than a small sample. When planning a survey, you might wish to minimise the size of the standard error to maximise the accuracy of the estimates. This can be done by choosing as large a sample as resources permit. Alternatively, you might specify the size of the standard error to be achieved and choose a sample size designed to achieve that. In some cases it will cost too much to take the sample size required to achieve a certain level of accuracy. Decisions then need to be made on whether to relax the accuracy levels, reduce data requirements, increase the budget or reduce the cost of other areas in the survey process.

Level of Detail Required

If we divide the population into subgroups (strata) and we are choosing a sample from each of these strata then a sufficient sample size is required in each of the subgroups to ensure reliable estimates at this level. The overall sample size would be equal to the sum of the sample sizes for the subgroups. A good approach is to draw a blank table that shows all characteristics to be cross-classified. The more cells there are in the table, the larger the sample size needed to ensure reliable estimates.

Likely level of Non-response

Non-response can cause problems for the researcher in two ways. The higher the non-response the larger the standard errors will be for a fixed initial sample size. This can be compensated for by assigning a larger sample size based on an expected response rate, or by using quota sampling. The second problem with non-respondents is that the characteristics of non-respondents may differ markedly from those of respondents. The survey results will still be biased even with an increase in sample size (ie. increasing the sample size will have no effect on the non-response bias). The lower the response rate, the less representative the final sample will be of the total population, and the bigger the bias of sample estimates. Non-response bias can sometimes be reduced by post-stratification as well as through intensive follow up of non-respondents, particularly in strata with poor response rates.

Sampling Method

Many surveys involve complex sampling and estimation procedures. An example of this is a multi-stage design. A multi-stage design can often lead to higher variance in resulting estimates than might be achieved by a simple random sample design. If, then, the same degree of precision is desired, it is necessary to inflate the sample size to take into account the fact that simple random sampling is not being used.

Relative importance of the variables of interest

Generally, surveys are used to collect a range of data on a number of variables of interest. A sample size that will result in sufficiently precise information for one variable may not result in sufficiently precise information for another variable. It is not normally feasible to select a sample that is large enough to cover all variables to the desired level of precision. In practice therefore, the relative importance of the variables of interest are considered, priorities are set and the appropriate sample size determined accordingly.

Calculation of sample size

When determining an appropriate sample size, we take as a general rule, the more variable a population is, the larger the sample required in order to achieve specific levels of accuracy in survey estimates. However, actual population variability is not known and must be estimated using information from a previous survey or a pilot test. It is worthwhile keeping in mind that the gains in precision of estimates are not directly proportional to increases in sample size (i.e doubling the sample size will not halve the standard error, generally the sample has to be increased by a factor of 4 to halve the SE). In practice, cost is a major consideration. Many surveys opt to maximise the accuracy of population estimates by choosing as large a sample as resources permit. In complex surveys, where estimates are required for population subgroups, enough units must be sampled from each subgroup to ensure reliable estimates at these levels. To select a sample in this case, you might specify the size of the standard error to be achieved within each subgroup and choose a sample size to produce that level of accuracy. The total sample is then formed by aggregating this sample over the subgroups. Sample size should also take into account the expected level of non-response from surveyed units. When the characteristic being measured is comparatively rare, a larger sample size will be required to ensure that sufficient units having that characteristic are included in the sample.

Sample Size Formulae

If a survey is designed to estimate simple proportions without any cross-classifications in a large population (approximately over 10,000 units), the following formulae can be used to determine the size of the sample:

where n = sample size, p = sample proportion, SE(p) = required standard error of the sample proportion However, to be able to use this formula, the proportion being estimated needs to be roughly known from supplementary information or a similar study conducted elsewhere. For example, suppose a survey seeks to estimate the proportion of Richmond residents in favour of Sunday night football at the MCG. The standard error (SE) desired is 0.04, while the proportion (p) in favour of the proposal is thought to be about 0.40. The size of the sample would need to be n=150. If this survey was then completed with a sample size of n=150 and it was a found that the sample proportion (p) in favour of the proposal was 0.8 (not 0.4 as guessed), then the standard error of this sample proportion of 0.8 would be 0.033 not 0.04 as originally planned for. A proportion of 0.5 gives the highest standard error for a fixed sample size or, requires the highest sample size for a fixed standard error, hence p=0.5 is the worst case scenario. It is for this reason that an estimate of p=0.5 is often used when calculating sample sizes when there is no information on the proportion to be estimated.

Example: Gains From Sampling

Suppose we wish to take a sample from a population. We have a preliminary estimate of the proportion of the population having the characteristic we are interested in measuring (50%). The level of accuracy we require from our survey is an RSE of 5%. Using the formulae for the sample size in a finite population. For various population sizes, the sample size that we would need is:

Population

Sample Size

Sample Fraction (%)

50

44

88.000

100

80

80.000

500

222

44.400

1,000

286

28.600

5,000

370

7.400

10,000

385

3.850

100,000

398

0.400

1,000,000

400

0.040

10,000,000

400

0.004

The gains from employing sampling are greatest when working with large populations.

Última postagem

Tag