Which of the following represents the correct order of a statistical analysis?

Most research projects can be classified into one of two basic categories: observational studies or designed experiments. In an experiment, researchers control (to some extent) the conditions under which measurements are made. In an observational study, researchers simply observe what happens, without controlling the conditions under which measurements are made. Both types of study follow the five steps of the Statistical Process.

Índice Show

Designed Experiments
Observational Studies
Sampling Methods
Types of Variables
What are the 5 basic methods of statistical analysis?
What are the 4 basic elements of statistics?
What is the first step in any statistical analysis?
How to write a statistical analysis?

Designed Experiments

In a designed experiment, researchers manipulate the conditions that the participants experience. They often do this by randomly assigning subjects to one of two groups, a “treatment” group and a “control” group. The experiment is conducted by applying some kind of treatment to the subjects in the treatment group and observing the effect of the treatment. Those in the control group do not receive the treatment and are also observed. In this way researchers can determine the effects of the treatment. The following example illustrates the use of these two groups.

Jonas Salk’s First Polio Vaccine Trial

Beginning around 1916 and through the 1950s, a mysterious plague attacked infants and children. Symptoms included excruciating muscle pain and a stiff neck. This illness, which became known as poliomyelitis or simply “polio,” left children disfigured, paralyzed, and sometimes even dead.

While working as a researcher at the University of Pittsburgh School of Medicine, Dr. Jonas E. Salk developed a vaccine that might help prevent the spread of this disease. He conducted what has become one of the most famous designed experiments in history.

This short video below provides a compelling summary of the famous Jonas Salk vaccine experiment. As you watch, notice each of the 5 steps of a statistical study in this study.

As explained in the video, in the first Salk trial almost 1.1 million children participated in the study. Even though the sample size was large, flaws in the study design rendered the results useless.

Undaunted, Dr. Salk fixed the problems with the design and enrolled hundreds of thousands of additional children for the second phase of his study. In all, over 1.8 million infants and children participated in this experiment, making it the largest drug trial to date.

Step 1: Design the study.

The participants in a study are commonly called subjects. Sometimes subjects are called experimental units or simply units. In the Salk trials, the children who participated were the subjects.

Subjects (the children) were randomly assigned to one of two groups. The first group was given the experimental vaccine, the treatment. The treatment is the new or experimental condition that is imposed on the subjects. The subjects who receive the treatment make up the treatment group.

The second group was given a control or placebo. In this study, the control was an injection that looked just like the vaccine, but contained a harmless saline solution. The control group or placebo group is made up of the subjects assigned to receive the control.

This study was double blind. Neither the children’s parents nor their doctors knew whether a particular child received the treatment or the control. Both parties were blinded to this information.

Because the children were assigned to the groups randomly, the two groups should be similar. If the vaccine is not effective, the number of future cases of polio should be about the same in each group. However, if Salk’s vaccine helped to prevent the spread of polio, then fewer cases should occur in the vaccinated group.

Answer the following questions:

Some children can be identified as having a higher risk of developing polio. Would it have been better if they were assigned to the treatment group so they could get the vaccine?

Show/Hide Solution

No. The two groups need to be as similar as possible. Specifically, the people in the treatment group need to have the same potential (on average) of contracting polio as the people in the control group. If we put the people who are at a higher risk of developing polio in the treatment group, we run the risk of having more people in the treatment group getting polio simply because they are more likely to get it, whether they are vaccinated or not. Likewise, we might have fewer people in the control group getting polio just because they are less likely to get it, whether they are vaccinated or not.
These two effects would create a bias against the vaccine, by making the vaccine look like it doesn’t work, or doesn’t work as well as it does. It might also make it appear that people who aren’t vaccinated stay healthy and the vaccine is not needed. There is even a chance that people will conclude that the vaccine actually gives people polio.
Randomly assigning subjects to the two groups tends to yield groups with similar characteristics—in this example, similar potential for contracting polio. Randomly assigning subjects to groups therefore defends us against problems like those mentioned in the previous paragraph.

Why is it important for the subject and those who assess the health of the subject to be unaware of whether or not that child received the vaccine?

Show/Hide Solution

Subjects: Suppose a subject in the study thinks they’re being treated. It has been documented that subjects with such knowledge tend to show improvement whether they are receiving the treatment or not. To see why, consider how you might feel and act if you were told you had been vaccinated. You might have a more hopeful outlook, leading to healthier living habits such as better hygiene and nutrition. Such changes would tend to reduce your chance of contracting polio whether you’ve received the vaccine or not. This might make the vaccine look like it works better than it does. It also might make the vaccine look like it works, even if it doesn’t.
Now suppose subjects in the control group know they are not being treated. This can also change the way they feel and act, in ways that can make them more likely to contract polio than they would be if they weren’t in the study. This could make it look like the incidence of polio among unvaccinated persons is higher than it is, again making the vaccine look like it works better than it does.
To reduce bias caused by such errors, subjects should not know to which group they are assigned.
Researchers: Suppose a researcher assessing the health of a subject is told that the subject is in the control group. It has been documented that in such a case, the researcher is more likely to record that the subject has symptoms even if the subject is not actually in the control group. This makes it look like unvaccinated persons are more likely to get polio than they really are, which makes it look like the vaccine works better than it does.
There are other effects of knowing to which group the subject belongs, such as doctors treating or advising the patient differently than they would without such knowledge. Such differences can make it harder to tell whether the vaccine works, and how well.
To reduce bias caused by such effects, those assessing the health of the subjects should not be told to which group the subject belongs.

Step 2: Collect data.

The researchers followed up with each child to determine if they contracted polio. They recorded the number of children in each group that developed polio during the study period. Not all of Salk’s experiments were double-blind. Here is a summary of the results from the regions where a double-blind study was conducted (Francis et al., 1955; Brownlee, 1955):

Children Who Developed Polio

Treatment Group

200,688

200,745

Placebo Group

142

201,087

201,229

Step 3: Describe the data.

One way to summarize the data is to compute the proportion of children in each group that developed polio. The proportion of children in the treatment group that developed polio during the study period is:

\[ \frac{57}{200745} = 0.000~283~9 \]

Answer the following questions:

Calculate the proportion of children in the placebo group that developed polio during the study period.

Show/Hide Solution\[ \displaystyle{\frac{142}{201229} = 0.000~705~7} \]

Compare the two proportions. What do you observe?

Show/Hide Solution

The proportion of children in the placebo group that develop polio during the study period was more than double the proportion of children in the treatment group that developed polio during the study period. That suggests that the treatment is effective in reducing the proportion of children that will develop polio.

Step 4: Make inferences

Careful statistical analysis of the records suggested that this difference was so great that it was attributable to the vaccine and not to chance. Assuming that the vaccine had no effect, the probability that the difference in the proportions between the two groups would be at least as extreme as the difference Dr. Salk observed was very low: 0.00000000093. Because this probability is so small, it is highly unlikely that these results are due to chance.

Step 5: Take action

Once it was clear that the vaccine was effective, children who were unvaccinated or had received the placebo were given Salk’s vaccine. Since 1954, there has been a marked decrease in the number of polio cases worldwide (Offit, 2005). Public health researchers are striving to eradicate this disease entirely.

Observational Studies

In an observational study researchers observe the responses of the individuals, without controlling the conditions experienced by the individuals. Therefore, they do not assign the participants to treatment or control groups.

Observational studies commonly occur in business settings. One example is a financial audit. The purpose of a financial audit is to assess the accuracy of a company’s financial business practices. ImmunAvance Ltd., a non-government health care organization, hired the Accounting Office at Global Optimization Unlimited to perform an independent audit of their financial practices. ImmunAvance provides inoculation and other preventative health care services in rural African communities.

Step 1: Design the study

The volume of financial transactions conducted by ImmunAvance makes it impossible to conduct a census or an examination of the entire collection of ImmunAvance’s financial documents. Instead, you will collect a manageable group of items (called the sample) from the entire collection of financial documents (called the population.) A sample is a subset or a portion of a population. The information gained from the sample is used to make an inference (or generalization) about the population.

Auditors typically cannot consider every item in a population, because there are too many. When it is not possible to conduct a census, auditors face sampling risk. Sampling risk is the risk affiliated with not auditing every item in the population. It is the risk that the sample may not adequately reflect the population. The only way to eliminate sampling risk is to conduct a census, which is usually not practical. Auditors can reduce sampling risk by obtaining a sample randomly. This is called random selection. Another way to reduce sampling risk is to increase the sample size, the number of items sampled.

Sampling Methods

Step 2: Collect data

There are several procedures that can be used to select a random sample from a population, including: simple random sampling (SRS), stratified sampling, systematic sampling, cluster sampling, , and convenience sampling (or, haphazard sampling). These are examples of sampling methods.

Random Sampling Methods

A simple random sample (SRS) is the best method for obtaining a sample from a population. This method allows each possible sample of a certain size an equal chance at being selected as the chosen sample. A difficulty of this method is that a list of all of the items in the population must be accessible before the sample is taken. Often, we obtain a SRS by allowing a computer to randomly select a certain number of items from the full list of the population. It is akin to the idea of putting all of the names into a hat, shaking them up, and randomly drawing out a few.

For example, suppose there are 18,000 students in the population of a certain university. School officials can use a computer to randomly choose values between 1 and 18,000 to identify which students are to be selected to complete a survey. In Excel, the command to obtain a random number between 1 and 18,000 is =RANDBETWEEN(1,18000). A simple random sample can be obtained any time there is a complete list of the items to be sampled and they are all accessible. All the statistical procedures in this course assume that simple random sampling has been used. But in practice, the SRS is often difficult (or impossible) to implement.

A stratified sample is when the items to be sampled are organized in groups of homogeneous (similar) items called strata, then a simple random sample is drawn from each of these strata. Stratified sampling works well when the items are similar within each stratum and tend to differ from one stratum to another. We often use stratified sampling in order to obtain a sample in such a way that we can make comparisons between each of the groups (or strata).

For example, in obtaining a sample of students from a university, school officials could define the strata as: (1) freshman, (2) sophomores, (3) juniors, and (4) seniors. A simple random sample could then be obtained from each of these strata. This would ensure that each class rank of students was represented in the sample. It would also allow the school officials to see how freshman, sophomore, junior, and senior level students compared in their answers to a survey.

A systematic sample is where every $k^{\text{th}}$ item in the population is selected to be part of the sample, beginning at a random starting point. Systematic sampling works well when the items are in a random, but sequential ordering. If the items are not arranged randomly, a systematic sample can miss important parts of the population.

For example, consider a fast food company where every 10th customer is given the opportunity to compete a satisfaction survey in exchange for a small discount coupon towards their next purchase. An airport security line also often implements a procedure where every 100th (or so) person is selected for a more “in depth” security examination. Similarly, factories that use assembly lines will pull say every 500th item from the assembly line to perform a quality control check on the item.

A cluster sample (sometimes called a block sample) consists of taking all items in one or more randomly selected clusters, or blocks. When the variation from one block to another is relatively low, compared to the variation within the block, cluster sampling is a reasonable way to get a sample.

For example, ecologists could draw grids on a map of a forest to create small sampling regions, or sampling clusters. Then, by randomly selecting one or two of these clusters from the map, the ecologists could go to the areas marked on the map and document information on the health of every tree they find in those clusters. This is a practical way to get a sample in this case because the ecologists only have to go to a few areas of the forest, but are still able to obtain a random sample of all of the trees in the forest. It is also worth noting that the ecologists would not be interesting in comparing the health of the trees from the selected clusters to each other like they would in a stratified sample. Instead, they are just looking for a feasible way to obtain a single random sample of all of the trees in the forest, but want to keep their traveling time to a minimum while collecting their sample. In contrast, to obtain a simple random sample of trees from the same forest, the ecologists would first have to go out and number every tree in the entire forest. Then they would need to use a computer to randomly pick which trees to collect data on. Finally, they would then have to go back to the forest and collect data on the selected trees from across the entire forest. Such an approach just isn’t feasible in practice, so we are willing to settle instead for the cluster sample.

A convenience sample involves selecting items that are relatively easy to obtain and does not use random selection to choose the sample. This method of sampling can be assumed to always bring bias into the sample.

As an example of a convenience sample, an auditor could haphazardly select items from a filing cabinet. This is frequently done when a quick and simple sample is needed, but may not yield a sample that represents the population well. When possible, convenience samples should be avoided.

Types of Variables

Whenever we collect data, we record information about the things we are studying. There are two basic types of data that can be recorded: quantitative measurements and categorical labels. We will call these types of data simply “quantitative” or “categorical” variables. We use the word “variable” to denote the idea that the quantitative measurements or categorical labels can vary from person to person, or item to item, in our study.

Quantitative variables provide measurement information on each individual (or item) in our study. They represent things that are numeric in nature; things that are measured. They often include units of measurement along with the quantitative value of the measurement. For example, the heights of children measured in inches (or centimeters), or their weight measured in pounds (or kilograms). For a quantitative variable, it makes sense to apply arithmetic operations to the data (such as adding values together, computing the average of the values, or comparing two values). If one child weighs 30 pounds (13.61 kg) and a second child weights 60 pounds (27.22) then the second child is twice as heavy as the first.

Categorical variables allow us to place each individual (or item) into to a specific category. Categorical variables are labels, and it does not make sense to do arithmetic with them. For example the gender of a newborn child, the ethnicity of an individual, a person’s job title, the brand of phone they own, or the area code of a telephone number, etc are all categorical variables. Notice that although a telephone number consists of numbers, it is not a quantitative measurement. It does not make sense to double someone’s phone number, to average phone numbers together, or to say one phone number is half the size of another. But the area code of the phone number gives information about the region where the phone number was first initiated, which is categorical information.

In Unit 3 of this course we will learn more about categorical variables and proportions. Units 1 and 2 of this course focus on studying quantitative variables.

Returning to the sample accounts receivable record, we find this data to have information on both types of variables.

Answer the following question:

For each of the following variables taken from this accounts receivable record, indicate whether the variable is quantitative or categorical.

Show/Hide Solution

The variable “Terms” is categorical. It classifies the invoice by the terms of payment for that invoice.

Show/Hide Solution

The variable “Account number” is categorical. Even though the account number is given a number, it is actually functioning as a label. It is not something that is counted or measured. It does not make sense to do arithmetic operations (like adding 1 or multiplying by 2) to the account number.

Show/Hide Solution

The variable “Invoice amount” is quantitative. It makes sense to do arithmetic operations to this value. For example, the amount of Invoice 5745 (which is $990.00) is somewhat more than twice as much as that of Invoice 2378 (which is $478.00).

Step 3: Describe the data

After auditors collect a sample and compile the data, they review the evidence. Auditors may use graphs or compute numbers (such as the average) to summarize the evidence they found.

What are the 5 basic methods of statistical analysis?

The five basic methods are mean, standard deviation, regression, hypothesis testing, and sample size determination.

What are the 4 basic elements of statistics?

Sample size, variables required, numerical summary tools, and conclusions are the four elements of a descriptive statistics problem.

What is the first step in any statistical analysis?

Step 1: Write your hypotheses and plan your research design. To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design.