Jeff Eicher teaches Probability and Statistics at the Academy for Arts, Science, and Technology in Myrtle Beach, SC. He's an AP reader and has served on College Board's AP Statistics Instructional Design Team. He also teaches "Introduction to Statistical Models," an online graduate course at Eastern University.

This year I wanted to introduce my students to a powerful statistical model known as multiple regression. Although *multiple regression* is not found in the College Board Course and Exam Description (CED), I believe it provides a nice opportunity to review and extend AP Statistics concepts, while not being too heavy of a lift for students.

**What is multiple regression exactly?**

In a linear regression model, we use a single explanatory variable to predict a quantitative response variable. A multiple regression model chimes in with, “Um, excuse me, why not use *more than one* explanatory variable?” For example, in Charles Wheelan’s helpful popular-level book Naked Statistics, he gives the example of predicting the price of a house. We could get reasonable predictions using, say, the square footage of the home. But why use just the square footage? We can use a host of other variables like number of bedrooms, number of bathrooms, location, acreage, when it was built, if it has a pool, etc. to get even better predictions.

Multiple regression is hugely relevant and deeply fascinating. In Josh Tabor and Christine Franklin’s book Statistical Reasoning in Sports (SRiS), they give many interesting examples of multiple regression models with sports contexts, such as predicting an NBA team’s playoff winning percentage using three variables: (1) the team’s regular season winning percentage, (2) if the team had a player who finished in the top 4 of MVP voting, and (3) if the team made the playoffs in the prior year. As you can tell, this is a much more interesting, nuanced, and complex model than we regularly deal with in AP Statistics.

**Should I really include a multiple regression lesson in AP Statistics?**

Absolutely! The step from a single predictor to a multiple predictor model isn’t as big of a jump as you may think. (Granted, it can lead to lots of bigger jumps, but it doesn’t need to at this point.) In the following activity below, multiple regression is introduced on the second page of the activity. I use this lesson in my AP Statistics class at the end of the regression unit as a review activity. Many of the concepts that students need to review for the test are covered in the activity, with the additional benefit of a slightly investigative context. Using activities like these will help students prepare for the investigative task on the AP exam, a question that connects to things they’ve learned in the course, but also presents new concepts or nonroutine contexts for students to deal with on the fly.

Of course, if you’re pressed for time (as usual?), you can use this activity after the AP exam. At the end of the course, students would know about the p-values in the regression output and can analyze which explanatory variables are significant. This topic could also easily morph into an end-of-course project, where students do the data collection and make predictions with their own multiple regression model.

**What topics are reviewed in this multiple regression activity?**

Lots! This activity uses real-world data from the 2021 National Football League, where students predict an NFL team’s winning percentage with the points they score (“points for”) and the points they allow (“points against”). Here’s a list of the topics they review that are part of the AP Statistics curriculum:

Describing and comparing scatterplots

Reading computer output and writing the equation of the LSRL

Using and interpreting S and R-Squared

Making predictions with a model

Choosing one model over another and justifying your choice with statistical reasoning

Calculating and interpreting residuals

Making conjectures about variables that may or may not influence a model

**Some more collateral benefits of teaching multiple regression**

I was motivated to teach my students about multiple regression for three main reasons:

1. It gives me the chance to mention the reason for r-squared (adjusted). In theory, with a multiple regression model, you can add 30 explanatory variables and gradually increase the value of R-Squared. What R-Squared adjusted does is it penalizes your model if you’re using unmeaningful or unhelpful explanatory variables. We shouldn’t try to artificially inflate R-Squared by adding as many explanatory variables as we can. Like Deangelo Vickers, R-Squared adjusted says “No!” to that approach. Let’s consider the multiple regression output from the activity:

This model with two explanatory variables had an R-Squared of 0.781, but a *smaller* adjusted R-Squared of 0.766.

Let’s say we make up some data using random integers from 1 to 5 and see if our model gains predictive strength. Running a multiple regression model with now *three* explanatory variables – points for (PF), points against (PA), and randInt1-5 – gives this output:

This additional *fictious* explanatory variable *increased* the value of R-Squared to 0.787! R-Squared adjusted, however, penalizes the model by “adjusting” for the additional poor explanatory variable by *lowering* R-Squared to 0.764. If we add more and more explanatory variables that aren’t beneficial and don’t add meaning to our model, R-Squared adjusted will tank.

In other words, my 2nd model with three explanatory variables (one being bogus!) had a higher R-Squared, but a lower R-Squared adjusted, than the model using just two explanatory variables (PF and PA).

Note: In the activity, I have students record the value of R-Squared, not R-Squared adjusted, just in case you don’t want to get into the difference.

2. My students don’t like the fact that we write the equation as

A multiple regression model illustrates the rationale for this unusual order of the slope coming at the end of the equation (unlike y = mx +b). A multiple regression model has the form:

which makes the *a + bx* order more meaningful. Each time we add another explanatory variable, we are further refining our predictions. Also, the coefficient of our constant (*a*) is at the top of the regression output, then the coefficient of our first explanatory variable (*b*₁), then the next coefficient (*b*₂), and so on (*b*ₙ).

3. Most of all, other than reviewing lots of regression topics mentioned above, the contexts are really fascinating. Regression is hugely important in the field of statistics and data science, and I wanted my students to get a sense for how useful it is in answering interesting questions.

**How will students do multiple regression in the lesson?**

Students will use stapplet.com and its *Multiple Regression* applet to make a model and use it to make predictions very quickly. Here’s what it looks like:

The Multiple Regression Applet:

The results with a click of a (“Begin analysis”) button:

At the bottom of the applet, it even gives a quick place to make predictions:

A team with 300 points for and 200 points against gives a winning percentage of 50.07%

**Activity Tips**

1) It’s best when students copy-and-paste the data into stapplet – it’s way too many numbers to type individually! Warn them about copying the data from the table in the notes, since I had to break it up into two tables to fit on the page. I recommend giving students this Google Sheet so the copy-paste process is easier with fewer hiccups.

2) Beware of decimal precision. Students may round the coefficients differently, so their residuals will be different.

3) If you use this activity as part of your chapter-end review, you can ask students follow up questions that are not explicitly asked in the activity, such as

How do you interpret the slope here?

This slope is negative, what does that mean in this context?

What does R-Squared mean in context?

If we were to use this model to make predictions, how ‘off’ will our predictions be?

What does a negative residual tell you here?

How would we calculate the correlation of this model?

Do we prefer high or low values of S? R-Squared?

4) Caution students to click the “included in model” box at the correct time. I would circulate looking at the R-Squared value they write for PA. If it’s identical to the one for PF, they didn’t click the correct box.

5) Make a table on the board recording R-Squared, S, and the residual, for a quick comparison by the end of the lesson:

6) In the multiple regression model, the interpretations of R-Squared and S sound the same as what students have learned, but the interpretation of the coefficients (i.e. the slope and y-intercept) is not. The coefficient for PF, for instance, would sound like this:

“For each additional 1 PF increase, the predicted winning percentage increases by 0.164% *assuming the PA doesn’t change*.”

7) It may be more meaningful to interpret the slope out of 100 points, instead of 1 point, increase. A 16.4% increase for each 100 PF increase makes more sense than a 0.164% for each 1 point.

8) The last two questions require a little knowledge of football or sports in general. These questions can easily be skipped, but they are valuable to consider what other variables we might like to investigate to improve our model.

**What Are Possible Extensions?**

There are lots of extensions possible. But I’d save those for after the exam. These include the issue of multicollinearity, the conditions when a multiple regression model is appropriate, interpreting the coefficients of the model, reading the ANOVA regression output, the F statistic, using categorical variables (“dummy variables”) in a regression model, assessing the effectiveness of the model (with statistics like deviance and AIC). That’s a lot! This explains why there are entire chapters and books written just on multiple regression.

## Comments