# Regression Solution

### Example homework problem:

You work for an automotive magazine, and you are investigating the relationship between a car’s gas mileage (in miles-per-gallon) and the amount of horsepower produced by a car’s engine. You collect the following data:

Automobile: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

Horsepower: | 95 | 135 | 120 | 167 | 210 | 146 | 245 | 110 | 160 | 130 |

MPG: | 37 | 19 | 26 | 20 | 24 | 30 | 15 | 32 | 23 | 33 |

Compute the best-fitting regression line (Y’ = a + bx) for predicting a car’s gas mileage from its horsepower. Test to see if the slope of this regression line deviates significantly from zero (alpha = .05).

The goal of your analysis is to __predict__ the value of the outcome variable (Y: gas milage) given a value for the predictor variable (X: horsepower). To do this, we will compute the best-fitting regression line:

where the predicted value of Y is equal to the Y intercept (*a*) plus the product of a given value of X and the slope of the regression line (*b*).

We will begin our computations by computing the __Variance/Covariance Matrix__:

Variable: | Horsepower (X) |
MPG (Y) |

Horsepower (X): | 2127.5111 | -243.4667 |

MPG (Y): | 48.9889 |

which presents the variance of X (2127.51), the variance of Y (48.99), and the covariance between X and Y (-243.47). The variable means have also been added. For help computing the variance and covariance, see the documentation for descriptive statistics and the documentation for correlation, respectively.

The slope of the regression line is equal to:

The slope of the regression line (*b*) is equal to the covariance between the two variables divided by the variance of the predictor variable (X).

Next, find the value of the Y intercept:

If you would like help computing the means, then see the documentation for descriptive statistics. Here is the completed regression equation:

**Conduct hypothesis test**. We would like to know if the value of *b* is significantly different from zero. There are several approaches. For example, we can use the *t* test for the slope:

where *b* is the slope of the regression line, *S(b)* is the __Standard Error __of the slope, and *n* is the number of pairs of scores. The standard error of the slope is equal to:

The numerator of this formula refers to the __Variance of the Estimate__. This is computed with this formula:

Now that you have computed the Variance of the Estimate, you can compute

the Standard Error of the Slope:

And now that you have the standard error of the slope, you can compute

the value of your *t* test:

Recall that the *df* for this *t* test is equal to *n – 2*, where *n *is the number of pairs of scores. So, our *df* is equal to 8.

Alpha was set at .05 and we will conduct a two-tailed test. When you consult your table

of critical values for *t*, you will find that if the absolute value of our obtained *t* is greater than 2.306, then we would conclude that the relationship between horsepower and gas mileage is significant — i.e., that the slope of the regression line is significantly different from zero.

Since the obtained *t* (-3.25) is greater in absolute value than the critical *t* (2.306), we would conclude that there is a significant negative relationship between the horsepower level of the automobile and its gas mileage — i.e., the greater the horsepower, the lower the gas mileage.

**The Standardized Regression Coefficient**. The problem with *b* is that it will vary greatly in value depending on the scales of X and Y. It is not a useful descriptive statistic at all. So, we will typically convert *b* to the __Standardized Regression Coefficient__ (β):

The standardized regression coefficient is equal to the slope of the regression line predicting Y from X after you have converted both variables to __Standard Scores__ (Z scores). When you have one predictor, β = *r*, the correlation coefficient. Think about it — the correlation coefficient is the slope of the regression line for predicting the Z of Y from the Z of X (in that case, the Y intercept will be 0). Don’t believe me? Convert both of your variables to Z scores and repeat your regression analysis — see how it turns out!

Now that you have *r*, you can easily compute *r*². *r*² = -.7541² = .57. This is the __Coefficient of Determination__; *r*² represents the proportion of variance in Y that can be accounted for by variation in X.

**The Analysis of Variance**. We will now cover another approach that you can use to see if the slope (*b*) of your regression line is significantly different from zero — the Analysis of Variance (ANOVA).

Our test statistic will be the value of the *F* statistic:

The *F* statistic is equal to the ratio of the *Mean Square Due to Regression* divided by the *Mean Square Error*. n is equal to the number of observations (i.e., the number of automobiles), and k is equal to the number of predictor variables (i.e., one: gas mileage).

**Compute the test statistic**. When we conduct an ANOVA, we typically construct a *Source Table* that details the sources of your variance and the corresponding degrees

of freedom:

Source | SS | df | MS | F |

Regression: | ||||

Error: | ||||

Total: |

To facilitate your ANOVA, you should compute the following statistics: a) the sum of all of your outcome scores (ΣY), the sum of all squared outcome scores (ΣY²), and the number of observations (n). In our case, ΣY=259, ΣY²=7149, and n=10.

Also, you should keep handy the value of *r*² that we computed above (*r*² = .5687).

Now, you are ready to conduct an Analysis of Variance. The *Sum of Squares* (*SS*) Total is equal to:

The *SS* due to regression is equal to:

The *SS* error can be found by subtraction:

Add these values to your *Source Table*. The *df* Total is equal to the total number of observations minus one (n – 1). In our case this is: 10-1=9. The *df* Regression is equal to the number of predictors (k). In our case this is: 1. The *df* Error is equal to the number of observations minus two (n – 2). In our case this is: 10-2=8.

Each Mean Square (*MS*) is equal to the SS for that source divided by the *df* for that source. In our case, *MS* Regression = 250.76 / 1 = 250.76, and *MS* Error = 190.15 / 8 = 23.77. Note that *MS* Error is __exactly__ equal to the Variance of the Estimate that we computed earlier — 23.678. That’s what *MS* Error means — error variance in our predictions.

Now you are ready to compute your test statistic. *F* is equal to:

Here is your completed *Source Table*:

Source | SS | df | MS | F |

Regression: | 250.755 | 1 | 250.755 | 10.55 |

Error: | 190.145 | 8 | 23.768 | |

Total: | 440.900 | 9 |

**Conduct hypothesis test**. In our study, the *F* statistic will have 1 and 8 *df*, and alpha = .05. If you look up the critical value of *F* in your table of critical values, you will see that if your obtained *F* is greater than 5.32, you would conclude that there is a significant relationship between X and Y (i.e., that a car’s horsepower is a significant predictor of the car’s gas mileage.

Since the obtained *F* (10.55) is greater than the critical value for *F* (5.32), you would conclude that there *was* a significant linear relationship between horsepower and gas mileage.

**Effect Size**. We already know that the correlation between horsepower and gas mileage (*r*) is equal to -.75, and that *r*² = .57. And, you know that *r*² represents the proportion of variance in Y that can be accounted for by variance in X. So, we can say that knowing the horsepower of cars allows us to account for 57% of the variance in the cars’ gas mileage scores.

However, the problem with *r*² is that it will systematically over-estimate the variance accounted for in your outcome variable; the smaller the sample size in your study, the more biased *r*² will be. So, we typically prefer to compute an __unbiased estimate __of the variance accounted for. There is a simple solution — we will compute __Adjusted R²__:

The numerator of this formula is equal to the variance in your outcome variable (Y) minus the Variance of the Estimate (or *MS* Error). Then, divide this difference by the variance of Y.

Notice that Adjusted *R*² will always be a smaller value than was *r*². We would conclude that using the horsepower of cars as a predictor variable can account for 51% of the variance in cars’ gas mileage.

Return to Regression Procedure

## Leave a Reply