Introduction
Introduction
Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you have a set of data with a scatter plot that appear to fit a straight line. This is called a line of best fit or leastsquares regression line.
Collaborative Exercise
If you know a person’s pinky (smallest) finger length, do you think you could predict that person’s height? Collect data from your class (pinky finger length, in inches). The independent variable, x, is pinky finger length and the dependent variable, y, is height. For each set of data, plot the points on graph paper. Make your graph big enough and use a ruler. Then, by eye, draw a line that appears to fit the data. For your line, pick two convenient points and use them to find the slope of the line. Find the yintercept of the line by extending your line so it crosses the yaxis. Using the slopes and the yintercepts, write your equation of best fit. Do you think everyone will have the same equation? Why or why not? According to your equation, what is the predicted height for a pinky length of 2.5 inches?
Example 12.6
A random sample of 11 statistics students produced the data in Table 12.3, where x is the third exam score out of 80 and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score?
x (third exam score)  y (final exam score) 

65  175 
67  133 
71  185 
71  163 
66  126 
75  198 
67  153 
70  163 
71  159 
69  151 
69  159 
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The data in Table 12.4 show different depths in feet, with the maximum dive times in minutes. Use your calculator to find the least squares regression line and predict the maximum dive time for 110 feet.
x (depth)  y (maximum dive time) 

50  80 
60  55 
70  45 
80  35 
90  25 
100  22 
The third exam score, x, is the independent variable, and the final exam score, y, is the dependent variable. We will plot a regression line that best fits the data. If each of you were to fit a line by eye, you would draw different lines. We can obtain a line of best fit using either the median–median line approach or by calculating the leastsquares regression line.
Let’s first find the line of best fit for the relationship between the third exam score and the final exam score using the median–median line approach. Remember that this is the data from Example 12.6 after the ordered pairs have been listed by ordering x values. If multiple data points have the same y values, then they are listed in order from least to greatest y (see data values where x = 71). We first divide our scores into three groups of approximately equal numbers of x values per group. The first and third groups have the same number of x values. We must remember first to put the x values in ascending order. The corresponding y values are then recorded. However, to find the median, we first must rearrange the y values in each group from the least value to the greatest value. Table 12.5 shows the correct ordering of the x values but does not show a reordering of the y values.
x (third exam score)  y (final exam score) 

65  175 
66  126 
67  133 
67  153 
69  151 
69  159 
70  163 
71  159 
71  163 
71  185 
75  198 
With this set of data, the first and last groups each have four x values and four corresponding y values. The second group has three x values and three corresponding y values. We need to organize the x and y values per group and find the median x and y values for each group. Let’s now write out our y values for each group in ascending order. For group 1, the y values in order are 126, 133, 153, and 175. For group 2, the y values are already in order. For group 3, the y values are also already in order. We can represent these data as shown in Table 12.6, but notice that we have broken the ordered pairs — (65, 126) is not a data point in our original set.
Group  x (third exam score)  y (final exam score)  Median x value  Median y value 

1  65

126

66.5

143

2  69

151

69  159 
3  71

159

71  174 
When this is completed, we can write the ordered pairs for the median values. This allows us to find the slope and yintercept of the median–median line.
The ordered pairs are (66.5, 143), (69, 159), and (71, 174).
The slope can be calculated using the formula $m\frac{{y}_{2}{y}_{1}}{{x}_{2}{x}_{1}}\text{.}$ Substituting the median x and y values from the first and third groups gives $m=\frac{\mathrm{}\mathrm{}174143}{7166.5}\text{,}$ which simplifies to $m\approx \text{}6.9\text{.}$
The yintercept may be found using the formula $b=\frac{\Sigma ym\Sigma x}{3}$, which means the quantity of the sum of the median y values minus the slope times the sum of the median x values divided by three.
The sum of the median x values is 206.5, and the sum of the median y values is 476. Substituting these sums and the slope into the formula gives $b=\frac{4766.9(206.5)}{3}$, which simplifies to $b\approx \text{}\text{}316.3\text{.}$
The line of best fit is represented as $y=mx+b\text{.}$
Thus, the equation can be written as y = 6.9x − 316.3.
The median–median line may also be found using your graphing calculator. You can enter the x and y values into two separate lists; choose Stat, Calc, MedMed; and then press Enter. The slope, a, and yintercept, b, will be provided. The calculator shows a slight deviation from the previous manual calculation as a result of rounding. Rounding to the nearest tenth, the calculator gives the median–median line of $y=6.9x315.5\text{.}$ Each point of data is of the the form (x, y), and each point of the line of best fit using leastsquares linear regression has the form (x, ŷ).
The ŷ is read y hat and is the estimated value of y. It is the value of y obtained using the regression line. It is not generally equal to y from data, but it is still important because it can help make predictions for other values.
The term y_{0} – ŷ_{0} = ε_{0} is called the error or residual. It is not an error in the sense of a mistake. The absolute value of a residual measures the vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line, or it measures how far the estimate is from the actual data value.
If the observed data point lies above the line, the residual is positive and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative and the line overestimates that actual data value for y.
In Figure 12.7, y_{0} – ŷ_{0} = ε_{0} is the residual for the point shown. Here the point lies above the line and the residual is positive.
ε = the Greek letter epsilon
For each data point, you can calculate the residuals or errors, y_{i} – ŷ_{i} = ε_{i} for i = 1, 2, 3, . . . , 11.
Each ε is a vertical distance.
For the example about the third exam scores and the final exam scores for the 11 statistics students, there are 11 data points. Therefore, there are 11 ε values. If you square each ε and add, you get the sum of ε squared from i = 1 to i = 11, as shown below.
$${({\epsilon}_{1})}^{2}+{({\epsilon}_{2})}^{2}+\mathrm{...}+{({\epsilon}_{11})}^{2}=\stackrel{11}{\underset{i\text{}=\text{}1}{\Sigma}}{\epsilon}^{2}\text{.}$$
This is called the sum of squared errors (SSE).
Using calculus, you can determine the values of a and b that make the SSE a minimum. When you make the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line of best fit has the equation
where
$a=\overline{y}b\overline{x}$
and $b=\frac{\sum \left(x\overline{x}\right)\left(y\overline{y}\right)}{\sum {\left(x\overline{x}\right)}^{2}}$.
The sample means of the x values and the y values are $\overline{x}$ and $\overline{y}$, respectively. The bestfit line always passes through the point $(\overline{x},\overline{y})$.
The slope (b) can be written as $b=r\left(\frac{{s}_{y}}{{s}_{x}}\right)$ where s_{y} = the standard deviation of the y values and s_{x} = the standard deviation of the x values. r is the correlation coefficient, which shows the relationship between the x and y values. This will be discussed in more detail in the next section.
LeastSquares Criteria for Best Fit
LeastSquares Criteria for Best Fit
The process of fitting the bestfit line is called linear regression. We assume that the data are scattered about a straight line. To find that line, we minimize the sum of the squared errors (SSE), or make it as small as possible. Any other line you might choose would have a higher SSE than the bestfit line. This bestfit line is called the leastsquares regression line.
Computer spreadsheets, statistical software, and many calculators can quickly calculate the bestfit line and create the graphs. The calculations tend to be tedious if done by hand. Instructions to use the TI83, TI83+, and TI84+ calculators to find the bestfit line and create a scatter plot are shown at the end of this section.
Third Exam vs. Final Exam Example
Third Exam vs. Final Exam Example
The graph of the line of best fit for the third exam/final exam example is as follows:
The leastsquares regression line (bestfit line) for the third exam/final exam example has the equation
Understanding and Interpreting the yIntercept
Understanding and Interpreting the yIntercept
The yintercept, a, of the line describes where the plot line crosses the y axis. The yintercept of the bestfit line tells us the best value of the relationship when x is zero. In some cases, it does not make sense to figure out what y is when = 0. For example, in the third exam vs. final exam example, the yintercept occurs when the third exam score, or x, is zero. Since all the scores are grouped around a passing grade, there is no need to figure out what the final exam score, or y, would be when the third exam was zero.
However, the yintercept is very useful in many cases. For many examples in science, the yintercept gives the baseline reading when the experimental conditions aren’t applied to an experimental system. This baseline indicates how much the experimental condition affects the system. It could also be used to ensure that equipment and measurements are calibrated properly before starting the experiment. In biology, the concentration of proteins in a sample can be measured using a chemical assay that changes color depending on how much protein is present. The more protein present, the darker the color. The amount of color can be measured by the absorbance reading. Table 12.7 shows the expected absorbance readings at different protein concentrations. This called a standard curve for the assay.
Concentration (mM)  Absorbance (mAU) 

125  0.021 
250  0.023 
500  0.068 
750  0.086 
1,000  0.105 
1,500  0.124 
2,000  0.146 
The scatter plot Figure 12.9 includes the line of best fit.
The yintercept of this line occurs at 0.0226 mAU. This means the assay gives a reading of 0.0226 mAU when there is no protein present. That is, it is the baseline reading that can be attributed to something else, which, in this case, is some other nonprotein chemicals that are absorbing light. We can tell that this line of best fit is reasonable because the yintercept is small, close to zero. When there is no protein present in the sample, we expect the absorbance to be very small, or close to zero, as well.
Understanding Slope
Understanding Slope
The slope of the line, b, describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.
Interpretation of the Slope: The slope of the bestfit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.
Third Exam vs. Final Exam Example Slope: The slope of the line is b = 4.83.
Using the TI83, 83+, 84, 84+ Calculator
Using the Linear Regression T Test: LinRegTTest
 In the
STAT
list editor, enter the x data in listL1
and the y data in listL2
, paired so that the corresponding (x, y) values are next to each other in the lists. (If a particular pair of values is repeated, enter it as many times as it appears in the data.)  On the
STAT TESTS
menu, scroll down and selectLinRegTTest
. (Be careful to selectLinRegTTest
. Some calculators may also have a different item calledLinRegTInt
.)  On the
LinRegTTest
input screen, enterXlist:
L1
,Ylist:
L2
, andFreq: 1
.  On the next line, at the prompt β or ρ, highlight
≠ 0
and pressENTER
.  Leave the line for
RegEQ:
blank.  Highlight
Calculate
and pressENTER
.
The output screen contains a lot of information. For now, let’s focus on a few items from the output and return to the other items later.
Graphing the Scatter Plot and Regression Line
 We are assuming the x data are already entered in list
L1
and the y data are in listL2.
 Press
2nd STATPLOT ENTER
to use Plot 1.  On the input screen for
PLOT 1
, highlightOn
, and pressENTER.
 For
TYPE
, highlight the first icon, which is the scatter plot, and pressENTER.
 Indicate
Xlist: L1
andYlist: L2
.  For
Mark
, it does not matter which symbol you highlight.  Press the
ZOOM
key and then the number9
(for menu itemZoomStat
); the calculator fits the window to the data  To graph the bestfit line, press the
Y=
key and type the equation –173.5 + 4.83X into equation Y1. (TheX
key is immediately left of theSTAT
key.) PressZOOM 9
again to graph it.  Optional: If you want to change the viewing window, press the
WINDOW
key. Enter your desired window usingXmin
,Xmax
,Ymin
, andYmax
.
NOTE
Another way to graph the line after you create a scatter plot is to use LinRegTTest
.
 Make sure you have done the scatter plot. Check it on your screen.
 Go to
LinRegTTest
and enter the lists.  At
RegEq
, pressVARS
and arrow over toYVARS
. Press 1 for1:Function
. Press 1 for1:Y1
. Then, arrow down toCalculate
and do the calculation for the line of best fit.  Press
Y=
(you will see the regression equation).  Press
GRAPH
, and the line will be drawn.
The Correlation Coefficient r
The Correlation Coefficient r
Besides looking at the scatter plot and seeing that a line seems reasonable, how can you determine whether the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatter plot) of the strength of the relationship between x and y.
The correlation coefficient, r, developed by Karl Pearson during the early 1900s, is numeric and provides a measure of the strength and direction of the linear association between the independent variable x and the dependent variable y.
If you suspect a linear relationship between x and y, then r can measure the strength of the linear relationship.
What the Value of r Tells Us
 The value of r is always between –1 and +1. In other words, –1 ≤ r ≤ 1.
 The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close to –1 or to +1 indicate a stronger linear relationship between x and y.
 If r = 0, there is absolutely no linear relationship between x and y (no linear correlation).
 If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all the original data points lie on a straight line. Of course, in the real world, this does not generally happen.
What the Sign of r Tells Us
 A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to decrease (positive correlation).
 A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to increase (negative correlation).
 The sign of r is the same as the sign of the slope, b, of the bestfit line.
Note
The correlation coefficient is calculated as the quantity of data point times the sum of the quantity of the xcoordinates times the ycoordinates, minus the quantity of the sum of the xcoordinates times the sum of the ycoordinates, all divided by the square root of the quantity of data points times the sum of the xcoordinates squared minus the square of the sum of the xcoordinates, times the number of data points times the sum of the ycoordinates squared minus the square of the sum of the ycoordinates. It can be summarized by the following equation:
where n is the number of data points.
The formula for r looks formidable. However, computer spreadsheets, statistical software, and many calculators can calculate r quickly. The correlation coefficient, r, is the bottom item in the output screens for the LinRegTTest on the TI83, TI83+, or TI84+ calculator (see previous section for instructions).
The Coefficient of Determination
The Coefficient of Determination
The variable r^{2} is called the coefficient of determination and it is the square of the correlation coefficient, but it is usually stated as a percentage, rather than in decimal form. It has an interpretation in the context of the data:
 ${r}^{2}\text{,}$ when expressed as a percent, represents the percentage of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (bestfit) line.
 1 – ${r}^{2}\text{,}$ when expressed as a percentage, represents the percentage of variation in y that is not explained by variation in x using the regression line. This can be seen as the scattering of the observed data points about the regression line.
Consider the third exam/final exam example introduced in the previous section.
 The line of best fit is: ŷ = –173.51 + 4.83x.
 The correlation coefficient is r = .6631.
 The coefficient of determination is r^{2} = .6631^{2} = .4397.
Interpret r^{2} in the context of this example.
 Approximately 44 percent of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam, using the bestfit regression line.
 Therefore, the rest of the variation (1 – .0.44 = 0.56 or 56 percent) in the final exam grades cannot be explained by the variation of the grades on the third exam with the bestfit regression line. These are the variation of the points that are not as close to the regression line as others.