Description of the Scientific Process: Analyzing Your Results
Analyzing Data
In a way, data analysis begins with experimental design. Assuming all went well during the experiment, your analysis will start with that design. Before you run your statistical analysis[1], it is helpful to lay your data out and examine what it seems to show. However, your data will generally be a bunch of numbers, and it’s hard to make anything out of that sort of thing. Depending on what you have, here are some things that you can do.
- Make a scatter plot. A scatter plot is a simple graph with each point positioned by two variables on a flat surface (Figure 1). Here’s what you can get out of this:
- You can spot outliers. Outliers are individuals that do not follow some common trend in your data. Very often (but certainly not always) outliers show that some individual underwent an experimental bias. Maybe that was the snake that you fed twice instead of once. Consult your notes and your common sense. We will show you what to do about outliers a little later.
- You can get a real sense of the general trends. You might see where things bunch up or how increasing one thing decreases another. You can find crucial insight here.
Figure 1: A scatter plot.
- Make a bar (or column) chart. It’s a simple way to visualize the magnitude of your data (Figure 1.)Make a line graph. These are especially useful for making changes over time visible (Figure 4).
- You can add error bars to indicating the confidence interval of your data as well.
Figure 2: A column chart with error bars.
- Make a pie chart. Pie charts are great ways to see proportions of data (Figure 3).
Figure 3: A simple pie chart.
- Make a line graph. These are especially useful for making changes over time visible (Figure 4).
Figure 4: A line graph with two variables and error bars.
- Logarithms can make trends over orders of magnitude look like linear trends. The most common approaches are to use log base 10 or the natural log of a variable.
- If a histogram of your data looks skewed to the right (i.e., there are a pile of values squished up on the left), try taking the square root of your values.
Sometimes, it’s only when you are analyzing your data that you realize that some of it might be bad. You took good notes, and you were consistent, but still, something doesn’t look right about individual 23A. One of the most common sources of bad data points is zero that isn’t a zero. Zero values often represent missing information. If a spider’s height is 0 mm, then it’s not a spider. (Maybe you squished it. No one would blame you.) Don’t represent missing values with 0’s; the convention is to use “∙” or “‒“. In a spreadsheet, it’s just an empty cell.
Be careful to know your instruments’ limitations. You can measure something with an instrument but then exceed the instrument’s measurement range. Think of a thermometer. There is some maximum and minimum temperature that it can measure. If you have exceeded the maximum temperature of 125° F, then you might only note that it was 125° F. That would be wrong. All sorts of other instruments can have their own limits, so be sure that your measurements fall within them.
Finally, you may have outlier individuals. You wrote down the wrong thing when you measured it. One of your reagents wasn’t thoroughly mixed, and it got a lot more salt than the others did. The list can go on and on. If you think that you have an outlier, then you should consider scrapping it (or remeasuring it, if that is possible.) If you can explain why a data point is not representative of the experimental conditions you set out to measure for it, kick the data point out.
Sometimes, data appears skewed, only because it needs to be transformed. You can transform data by performing some mathematical trick on all of the data points. The goal of your transformation is to fit the transformed data into a normal distribution.
Statistics
Table 1: Parameters and estimates.
Variable or Estimate |
Name |
Sample Mean |
|
Population Mean |
|
Sample Size |
|
Value of individual |
|
Sample Variance |
|
Standard Deviation |
|
Population Variance |
|
Standard Error |
|
Probability of Success |
|
Sample Proportion |
|
Estimate of Variance of Binomial Data |
|
Estimate of the Standard Error of Binomial Data |
|
Margin of Error |
|
t-Test Statistic |
|
z-Test Statistic |
|
or |
Pearson Correlation Coefficient |
Many students will cringe at hearing the word “statistics,” especially when they are being told to engage in it. Honestly, though, knowing how to use statistics is an amazingly useful skill. Even if you never do a science experiment again, understanding what statistics can do will help you be a better citizen. You will better understand opinion polling or the news of the latest study about a vaccine. Depending on your own talents and background, statistics can be hard, but you can overcome it. Or, you just find it incredibly easy, and you don’t need the pep talk. Either way, let’s talk about what statistics is.
The Oxford English Dictionary defines statistics as, “The systematic collection and arrangement of numerical facts or data of any kind; (also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data.” That’s a pretty good definition[2] for our purposes. Statistics, as we are using it here, has to do with how we arrange and interpret the raw data collected from scientific research.
Let us say that you want to know how tall oak trees are. You could tell me that the oak tree outside of your window is 12.3598 meters high. I’ll believe you (well, maybe not to the tenth of a millimeter, but it could be true.) You cannot use that information to then say that all oak trees are 12.3598 meters high. That same one wasn’t that tall last year. Then, you tell us that you cored the tree and counted the rings. It is 34 years old, you say. Therefore, all 34-year-old oak trees are 12.3598 meters high. Again, absurd. Trees, as people, vary in height according to many factors, and we should account for those. Besides, can there possibly be that many significant digits[3].
This statistics guide is not intended to be about the theory of statistics or even remotely substitute for a statistics class. Instead, we hope to arm you with some tools and tests that you can use to interpret your results. Hopefully, as you read this, you will find an appropriate tool to use. |
We should instead consider all of the 34-year-old oaks in the area, right? Well, that gets closer, but when we are talking about these things, we need to consider things that we can compare. The word “oak” refers to all of the species in the genus Quercus, of which there are hundreds, some of which are shrubs and not trees. We’ll probably want to narrow our query to one species. Also, a tree growing in a swamp will do better or worse than one growing on a hill, depending on its preferences for water. We’ll need to account for that. Shade, too. The list goes on. We will need to figure out how tall oaks of those types are. Still, even if we accounted for all of those variables, we would still find that the oaks came in different sizes. What we need is an average!
Means and Variation
There are different sorts of average. The main ones we will refer to here are the mean, the median, and the mode. The mean (technically, the arithmetic mean, as there are other sorts of means) is the sum of all of the values of interest, divided by the number of things you measured. Here it is mathematically:
Where is the mean of your sample, is the approximate mean, is the sample size and ai is the value measured for individual . Got that? Yeah, we’re going to have to talk about that little mess. (“x-bar”) is the mean of your sample, the best estimate of the true mean, (“mu”) which we call (“mu-hat,” as we’d say; it’s the Greek letter mu with a hat on it.) It’s approximate, because we don’t have every single example in nature of whatever it is we’re measuring. We’re sure you get the 1/n bit, as that’s just one divided by n. The weird E looking thing is the capital Greek letter sigma which we use to notate a sum. The bit is what is being totaled. We measure value (let’s stick with height for our example) a for individual 1, and we call that . The term tells us to start at 1. For individual 2, the height is , and we continue until we get to the last one measured, height value . You already knew how to do a mean, though, so why did we make it more complicated? We’ll be using similar notation going forward, and you may see similar things in books and scientific papers. It’s good to be caught up with the basics.
Let’s do an example. We’ve gone out and measured 10 34-year-old Quercus alba (white oak) trees that were growing in identical conditions. Their heights were:
Table 2: Oak tree heights.
Tree ID |
Height (meters) |
Tree ID |
Height |
1 |
12.1 |
6 |
12.1 |
2 |
12.4 |
7 |
11.9 |
3 |
10.9 |
8 |
11.2 |
4 |
11.8 |
9 |
11.8 |
5 |
12.0 |
10 |
12.1 |
To create the average, I take the equation:
and we substitute and all of the heights to get:
The mean is most meaningful when your data has a normal distribution (Figure 5). That is, most of the values are close to the mean, and less frequent values are much larger or smaller than the mean. We usually assume that data has a normal distribution, but that might not be the case. If you are using Excel, the Average function will calculate the mean of a range of cells.
Figure 5: Data in a normal distribution.The x-axis represents the values for the data, and the y-axis represents the frequency of the data. |
The next most frequently encountered kind of average is the median. If you arrange all of your data from the smallest value to the largest value, and then you pick the middle value, you have a median. If you have an even number of values, then calculate the mean of the two middle ones. It’s pretty easy to determine, and spreadsheets will readily generate your median for you (the Median function in Excel is very straight-forward.)
Figure 6: Annual income distribution in 2008, as estimated in 2012. This is not a normal distribution (though it is very typical for an income distribution.) Note that the final 2 columns are relatively large, because they contain wider ranges of incomes. |
Why would you want a median? The classical example is income. In 2004, the United States Census Bureau calculated the mean American household income as $60,528. However, the median income was $44,389. Why are these numbers so different? Income does not follow a normal distribution (Figure 6[4]). Both the mean and the median are valid ways to describe a normal or average individual, but their meaning depends on context.
The final way that we will consider the average is the mode. The mode is simply the most common number that comes up in your sample. If you consider the white oak height data that we made up, you will see that the number 12.1 comes up 3 times. That is our mode. If there is a tie for most common number, we have two modes. If there isn’t a number that comes up most frequently, then there is no mode. We can round our numbers to get a new mode, but you will definitely want to report doing so in a paper. However, few papers use modes, so you will need to justify why you think it is interesting.
Just because you have an average, that doesn’t mean that you have described your data completely. Even ordinary individuals will be different from the mean. They vary. There are several related measures of this phenomenon: variance, standard error, and standard deviation. Let’s talk about sample variance[5] first. You estimate variance of your sample this way when we have a normal distribution:
Where is the sample variance, is the sample size, is the value of individual , and is the sample mean. Once again, we have a summation, but this time, we will be doing something a little more involved. Each part of the sum will involve subtracting the sample mean from the value of individual and squaring the result. For our oak experiment (Table 1), it will look like this:
Sample variance is not necessarily useful when we analyze data, except that it is used in other calculations. It has units, but those units are square units relative to the mean, so if you are measuring the length of crystals in an evaporation experiment, your lengths could be in millimeters (mm), but variance would be in square millimeters (mm2), even though it is not a reflection of area. A small variance indicates that the individual values in the sample are very similar to the mean, whereas a large value indicates that they are more widely distributed (Figure 7).
Figure 7: Smaller variances (like the tall red curve in the center) indicate that the individuals have more similar values. Larger variances (like the flat blue curve that spans the graph) indicate that individuals have less similar values. From Wikimedia Commons user MarkSweep. |
If variance isn’t immediately useful, how do we use it? The first thing that we can do is calculate the standard deviation. The equation for that is:
Whoa! That’s even more complicated. Yes, but you could also say that standard deviation is just:
That’s easy. Since standard deviation () is in the same units as your sample measurements and mean, you can use it to help you analyze your data.
Not all data are recorded as continuous values as we have been using, let’s consider binomial data. Binomial data values are typically represented as being 1’s and 0’s, and the probability of success of a sample or population is a value from 0 to 1. The most common binomial information you see is in the form of percentages, as you see in public opinion polls, as in 47% of people approve of the President’s plan to increase kumquat consumption. Any time you say that X% of something is Y, you are using binomial data. On your record sheets, you have recorded 0 or 1, dead or alive, in favor or opposed, brown or not brown, etc.
The variables for binomial data are slightly different from those used for continuous data. With binomial data, the theoretical mean[6] is:
Where is the “probability of success,” a term referring to the fraction of the time a condition is true. For coin flips, the probability of success for heads is 0.5, since half the time, heads comes up. If we flip a coin 100 () times, then we expect the mean to be 50. However, we can estimate with the sample proportion:
Where is the number of successes. It is the closest equivalent of an approximation of the population mean . If we actually flipped the coin 100 times, we might get 52 heads by chance. In that case, . In this case, we are estimating the probability of success by measuring the sample proportion ().
The variance for the binomial distribution is:
The standard deviation is thus:
The variance of your sample proportion () can be computed with the equation:
Keep in mind that these figures are based on the sample proportion and not the number of successes. That is, it is based on the fraction and not the count. Notice that the larger your sample size (), the smaller the variance of the sample proportion and the standard error of the sample proportion.
Confidence Intervals
When you calculate means and variances, you describe important properties of your data, but that does not mean that you can make any useful claims about it. You have often heard the phrase, “give or take,” as in, “It’s about 50 pounds, give or take.” That phrase can describe an approximate mean with wiggle room for uncertainty. When we use statistics, we can assign meaning to “give or take” with confidence intervals. These are minima and maxima[7] of estimated means. That word confidence is very important. We can only be some percent confident that our real mean or probability of success falls between two numbers. Usually, we use the cutoff of 95% confidence, because in order to be 100% confidence, we would have to settle with, “It’s a number.”
Let’s start with continuous data. You will need to calculate the standard error ( , the Greek letter sigma with an x subscript):
If your data has a normal distribution, then 68% of all of the individuals will have values between , or approximately . Likewise, 95% of the individuals will have values between , or approximately (Figure 8). In other words, you can often say that 95% of the individuals in a population are two standard errors away from the mean. Thus, the confidence interval is . Now, consider two averages, and . Can we say that they are different from each other? Using confidence intervals, we can check their lower and upper bounds. Let’s say that . If , then their confidence intervals overlap, and we cannot conclude that they are different with 95% confidence.
Figure 8: Standard deviation around a mean in a normal distribution. From Wikimedia Commons user Ainali. |
Error bars in graphs are very important applications of confidence intervals. You can use them to visually display how confident you are of your data, and your audience will be able to tell just how different your various means are from each other. You can use these in bar charts or line charts (Figure 2 and Figure 4, for example.)
When we use binomial data, the confidence interval is based on the margin of error:
We are using the number 1.96 for what is called a z-value, a concept that is beyond the scope of this guide[8]. Once you have the margin of error, you can say that you are 95% confident that the population proportion is between and (or, ). If you have a polling sample of 100 of your peers, and 45 of them said that they preferred country music over rhythm and blues then you could state with 95% confidence that the actual proportion of like-minded youth who like the twang:
and
That is a big range, but if you increase your sample size to 1000, then it shrinks to 0.42 and 0.48. You can report that as “ ”.[9]
Hypothesis Tests
In statistics, the term hypothesis has a somewhat different definition than it does in the sciences, but it is rather similar. As in science, a statistical hypothesis is a testable prediction, but it is also a mathematical expression, and we can perform a hypothesis test to help analyze our data. A statistical hypothesis might be something like, “The average length of Amythas spp. (common Asian earthworm) found in Oklahoma County is a different size than the length of common Asian earthworms collected in Payne County;” “The average specific gravity of selenite crystals from the Great Salt Plains State Park is the accepted specific gravity for selenite minerals: 2.3;” or, “The average falling time for these bowling balls is the same.” In further analysis, we will talk about those being null hypotheses. The shorthand for the null hypothesis is . In opposition to the null hypothesis, we also have the alternative hypothesis ( or ). While the null hypothesis is always (or “mean length of earthworms in Oklahoma County = mean length of earthworms in Payne County,” “specific gravity = 2.3,” or, “bowling ball 1 falling time = bowling ball 2 falling time,” from the examples above), the alternative hypothesis can be:
or or
To test these hypotheses for continuous data, we can use t-tests. We start by calculating the t-test statistic:
Where is a value from what is called the t-distribution[10] with degrees of freedom; is the mean of your sample; is what you have hypothesized to be (the length of the earthworm or the specific gravity of selenite from the examples above); is the standard deviation of your sample; and is the sample size of your sample. The term “degrees of freedom” refers to the number of independent ways that the numbers may vary given the constraints of the system. For variance and standard deviation, we use degrees of freedom.
Next, you will need to use this to calculate or look up a p-value. You can either use a t-table, as is tradition[11], or you can use software like Excel to calculate it. The p-value is a number between 0 and 1 that you can use to accept or reject the null hypothesis (that .) If your p-value is more than 0.05, then you can claim that it has passed the 95% confidence test, and you fail to reject the null hypothesis. When we find that two means are different from each other using a p-value, we say that their difference has statistical significance.
Figure 8: Standard deviation around a mean in a normal distribution. From Wikimedia Commons user Ainali.
The way in which the p-value is checked depends on the your alternative hypothesis ( or or ). When the alternative hypothesis is that , the p-value is derived from the area that makes up 95% of the t-distribution (Figure 9). The left and right “tails” of the t-distribution each comprise of 2.5% of the remaining area, and if the t that you calculated falls in one of those two areas, then you can reject the null hypothesis and declare your mean to be different from the hypothesized mean. For the alternative hypothesis that , you will instead need to use one tail, and that tail makes up 5% of the area under the right curve. The Excel guide includes instructions for differentiating among the different alternative hypotheses.
We use a z-test for determining the p-value for binomial data[12]. In that case, the null hypothesis is , that is that the proportion of the population for some trait is . We will then calculate:
Where is your sample proportion, is the hypothesized proportion, and is the sample size. We can then use Excel[13] to determine the p-value.
Very often, we do not have some automatic expected value of or . Instead, we are comparing two or more samples to each other, i.e., a control and a treatment. You will want to know if they are different from each other. If you have two samples, you can use a paired t-test. You will have to choose a formula, depending on your results as shown in Table 3.
Table 3: How to calculate t-statistics given different conditions. Notice that we use a z-statistic instead of a t-statistic when we calculate with proportions.
Conditions |
t/z calculation |
s |
degrees of freedom |
||
Equal sample sizes and equal variances and |
|||||
Unequal sample sizes and equal variances and |
|||||
Unequal variances |
|||||
Proportions |
Note that is the proportion of the combined sample. |
Generally, the Excel Guide to Statistics will show you how to do these very simply in Excel.
Pearson Coefficients
Often times, when we look at data, we look for correlations. Data correlate when they have some sort of relationship with each other. Maybe they both ascend with each other, or when one variable is larger, the other is smaller. Pearson coefficients are the primary and simplest means of analyzing correlations. We start by calculating the correlation coefficient:
Figure 10: Scatter plot with trend line, , and p-value. |
For each sample, you will need the value for two variables, and . You will need to calculate the means and standard deviations for all individuals for each variable to get and , as well as and . For each pair, you will need to calculate , sum those, and then divide by . This whole process can be done in Excel with a simple function. The resulting correlation coefficient is very useful. Numbers near 0 indicate weak or poor correlation. If is close to 1, then you are seeing a positive correlation, so that as one variable increases, the other decreases. When is nears -1, there is a negative correlation, indicating that as one variable gets larger, the other one gets smaller (Figure 11).
Figure 11: Pearson coefficient of various scatter plots. The Pearson coefficient is primarily used to determine how linear sets of paired data are to each other. Image courtesy of Wikimedia user DenisBoigelot. |
Keep in mind that this is only a metric for linear relationships, those that behave as lines. There are more sophisticated tests for other sorts of relationships, but those are beyond the scope of this document. If you think that there is another sort of relationship, you can attempt to correlate transformed data. For example, if you think that , then you should replace with .
Pearson coefficients are frequently reported as instead of . Funny enough, . Keep in mind that because it is a square number, is a value between 0 and 1 and not -1 and 1, like is. You may also generate p-values from values. To do so:
The t-test involves degrees of freedom.
Other Statistics Tests
Analysis of Variance (ANOVA) is a set of tools that can be used to compare multiple population means as well as multiple variables for the same populations. Think of it as a beefed up, caffeinated t-test. If you have lots of test populations and lots of variables, then you may want to look into using ANOVA to analyze your data. If you have multiple variables, then it can also be used to determine if there are interaction effects, situations in which the effects of the two variables are worth more (or less) than the sums of their individual contributions.
Chi-squared tests () tests can be used to find out if a set of categorical data fits expectations. If you have can categorize individuals into certain kinds of things, and count how many fit into each category, you can test whether those counts are similar to what you would expect. For example, you could collect flowers and categorize them by color. You might expect that all colors would have equal counts, so you could compare your counts and see if they are statistically different numbers from the expectations. Chi-squared tests are frequently used in genetics studies.
If it is important to calculate the formula for a regression line (as seen above with the Pearson coefficients), then you can use the least squares method. It is relatively straightforward, and it can be easily calculated in Excel. If you have multiple variables that you are comparing in regression analysis, you can calculate regression lines in multiple dimensions using multiple linear regression.
There are hosts of other important tests that have been designed in the last 150 years. Information about them are scattered around the internet and in textbooks. If you don’t know what to do with your data or are unsure how to proceed, then you should absolutely ask a teacher, a college professor, or even a discussion forum on the internet.
[1] Here, I am assuming that you are using statistical analysis with quantitative data. There is some science that does not require statistics. Consider a paleontologist measuring a dinosaur bone or a primatologist observing chimpanzee behavior. Medical case studies are regularly published, and medical science is all the better for it. You might have an experiment devoid of statistics, but these experiments are rarer, and scientists are more likely to accept your results, if you can apply rigorous statistical tests.
[2] These days, people often use the word statistics (often shortened to stats) to describe an athlete’s performance or the abilities of a fictional character in a video game, these are colloquial usages of the term, and those are often misleading when they are applied to science. When a scientist uses the word theory, for example, he or she is referring to an idea that has been repeatedly tested and modified. It’s something that a scientist can call true to the best of our knowledge. It’s testable and has withstood those tests. In common parlance, a theory is more like a hunch, someone’s guess at what might be true. Beware of people who apply common usage definitions of words when they are in a scientific context and vice versa.
[3] Significant digits, or significant figures, are the digits in a number that you can trust not to be too precise but inaccurate. As a student, I had trouble understanding them at first, but the truth of the matter is that when you multiply or divide numbers, you will often end up with many more digits than you can expect to accurately reflect what you are talking about. It’s not so much that they are wrong, but they can be misleading about how accurate your measurements could possibly be or can simply distract the reader from the value of the number. That is, long numbers (with lots of digits) aren’t necessarily the same as big numbers. While there are rules for determining significant digits, you should consider how accurate your instruments are and how readable your text is. Also, you should not have more significant digits than your instrument can precisely measure.
[4] Figure 6 is a histogram. It’s a kind of bar graph in which individuals are pooled into groups that fall within a range. Instead of counting all of the people who made $45,322.37 per year into one tiny group and repeating that for every income level down to the cent, we can group similar values into a range.
[5] We distinguish sample variance from population variance. Remember that most studies use a representative sample of the population instead of the whole population.
[6] Theoretical mean is another word for the mean of the population.
[7] Minima is the plural of minimum, and maxima is the plural of maximum.
[8] You can substitute 2.33 for the z-value to get 98% confidence or 2.58 to get 99% confidence.
[9] This formula for the margin of error is weaker when is near 0 or 1. One solution when you have small or large proportions is to use the Wilson score interval.
[10] The vagaries of the t-distribution are beyond the scope of this guide. In essence, it is a normal distribution based on uncertainties arising from estimations. We use it when we are making claims based on data.
[11] Textbooks still insist on using tables for all sorts of things, probably because you can’t use the technology of books to do these things. Scientists these days don’t use tables. We have computers to do the same thing even better. If you take a statistics course in college, expect to use t-tables.
[12] Remember that binomial data is stuff like yes/no or 1/0.
[13] See the Excel guide on the website.