Jeff Eicher, Jr. is a Data Science Professor at Eastern University in St. Davids, PA and taught high school math and AP Statistics for 13 years. He has assisted with supplemental materials for The Practice of Statistics, Statistics and Probability with Applications and the MathMedic Assessment Platform. He has been an AP Statistics Exam reader for the past 8 years. He enjoys spending time with his wife and daughter, playing cards and board games, jumping on the trampoline, and watching all things Frozen.
A large exercise center has several thousand members from age 18 to 55 years and several thousand members age 56 and older. The manager of the center is considering offering online fitness classes. The manager is investigating whether members’ opinions of taking online fitness classes differ by age. The manager selected a random sample of 170 exercise center members ages 18 to 55 years and a second random sample of 230 exercise center members ages 56 years and older. Each sampled member was asked whether they would be interested in taking online fitness classes.
The manager found that 51 of the 170 sampled members ages 18 to 55 years and that 79 of the 230 sampled members aged 56 years and older said they would be interested in taking online fitness classes.
At a significance level of α = 0.05, do the data provide convincing statistical evidence of a difference in the proportion of all exercise center members ages 18 to 55 years who would be interested in taking online fitness classes and the proportion of all exercise center members ages 56 years and older who would be interested in taking online fitness classes? Complete the appropriate inference procedure to justify your response.
This question had a few surprises compared to inference questions in previous years. The biggest was that it was the first FRQ! Nearly every year prior, the first question was on exploratory data analysis, offering students a graphic to analyze and discuss, while the inference question was later (typically FRQ #4). I confess that I thought the CED’s ordering of the questions by skill number implied that the first FRQ would from be on Skill 1 (Selecting Statistical Methods). Boy was I wrong!
Another surprise (compared to previous rubrics) was that this rubric didn’t accept merely “evidence” in the conclusion but required an adjective like “convincing” or “strong” or “statistical” evidence. I guess I missed that in the 2023 inference question rubric. More on that later.
The last surprise came when I analyzed common problems that students had. For those experienced readers out there, which component do you think was missed by the highest percentage of students? Which component did the highest percentage of students earn? What component would set apart a student scoring a 3 and a student scoring a 4? What was the most common scoring of a 2 (PPP, IEE, EEI, EPI, PEI, or another permutation)? What was the most common scoring for a 1 paper (two Ps or one E…on which part)? Make your guesses then see the “Data Analysis” section at the end of the post.
SCORING
This question was scored essentially correct (E), partially correct (P), or incorrect (I) in three sections.
Section 1 had three components:
(1) Identify the appropriate procedure for a difference in population proportions with name or formula (with variables or numbers)
(2) State correct hypotheses (a null of equal proportions and a two-sided alternative)
(3) Include sufficient context (the two groups and the population).
An essentially correct response needed all 3 components, while a partially correct response was 2 components of the 3.
WOULD THIS GET CREDIT? (component 1 - identify the procedure)
Response 1 is correct. Either name was accepted, even though the first was calculator-speak and the second could be confused with a Two Sample Z Test for a difference in means (when the population standard deviations are known).
Response 2 is correct. This response uses the formula with correct values plugged in (even though they should use the combined proportion in the denominator).
Response 3 is incorrect because we test for the difference in population proportions, not sample proportions.
Response 4 is incorrect because we test for the difference in population proportions, not sample proportions.
Teaching tips:
Give students the opportunity to use 2PropZTest on the calculator with counts given and percentages (or proportions) given. Many students knew what test to use on their calculator and implemented it well (that’s probably because counts were given, rather than percentages; see the more difficult 2019 #4).
Requiring students to write the full test statistic formula (and calculating the p-value with a -cdf command) is beneficial conceptually and for preparing for the MCQ section. Discourage students from doing so on the inference FRQ; it takes too much time and is too dangerous. Numerous students had acceptable (full credit) responses until they tried to write the formula.
WOULD THIS GET CREDIT? (component 2 - hypotheses)
Response 1 is correct because the response uses standard notation (p) for proportions and has a null of equality and an alternative of inequality.
Response 2 is correct because the response correctly describes, with words, a null of equality and an alternative of inequality.
Response 3 is incorrect because, although the student tries to use context directly from the stem of the question, the response does not indicate a difference in proportions.
Response 4 is incorrect because the response does not indicate a difference in proportions.
Teaching tips:
Some students defined p1, p2, and p1 – p2. This is more than is needed. Defining p1 and p2 is enough OR defining p1 – p2 is enough. In my teaching, I would have my students define p (or pi, where i is the group) to save time and writing.
Using meaningful subscripts can be very helpful. It’s easier to see what you’re comparing; I would even recommend stop using p1 and p2 altogether! Three cheers for more context!
WOULD THIS GET CREDIT? (component 3 - context of population and groups)
Response 1 is correct for many reasons. It uses three different ways of referring to the population (“population proportion”, “all”, “who would take”) and identifies the two groups.
Response 2 is correct because the standard notation of p is used and there are two groups (1 & 2). If the response stopped there, it would be unclear which group was which. But thankfully, later in the response, it mentions the two groups, so this was deemed sufficient.
Response 3 is incorrect because the language of “who said” refers to the sample, rather than the population, even though the response included “population” proportion.
Response 4 is incorrect because the language of “interested” refers to the sample, rather than the population.
Teaching tips:
The issue for most students was identifying the population, rather than the groups. There weren’t many ways to get the groups wrong, but the students revealed a lack of understanding using past tense verbs and other language referring to the sample in their parameters. When reviewing answers to inference questions, pause and ask, “Are you talking about the population or sample?” Before having students read their definition of the parameter, show them a few poorly worded hypotheses (that use past tense, or “sample proportion”, etc.) Or, have them write an “almost correct” definition of the parameter and their neighbor has to find the problem.
Related to the first bullet, make sure students know the difference between p-hat and p. When first reading an FRQ, have them label the information: is this p-hat? Is this p? Is this n? Is this X (the count)? You could even give them assessment where all they have to do is label the given information in the stem. You could give 10 stems in 10 minutes, where they don’t perform any of the steps; just label the important information.
Section 2 had four components:
(1) Check independence condition (random samples and 10% condition)
(2) Check that samples are large enough
(3) Correctly report the test statistic
(4) Correctly report the p-value (consistent with the reported test statistic and alternative hypothesis).
An essentially correct response needed all 4 components, while a partially correct response was 2 or 3 components of the 4.
Gone are the days of 2019 when students could earn an E with only 4 out of 5 components in this section. But that should be no surprise. This section’s rubric is structured like the rubric of the inference question in 2021 and similar to 2022 and 2023, with slight differences due to the type of inference procedure.
WOULD THIS GET CREDIT? (component 1 - random and independence conditions)
Response 1 is correct because both groups were randomly selected and the 10% condition was checked for both individual groups.
Response 2 is correct because each sample was checked for randomness and independence separately.
Response 3 is incorrect for two reasons. First, it’s unclear if one random sample was taken (then divided into two groups) or if the groups were sampled separately. Second, “all gym members” sounds like one single population, whereas the condition should check the 10% for two separate populations.
Response 4 is incorrect because it sounds like the combination of 170 and 230 (so 400?) is less than 10% of a single population. Also, the response did not check that both samples were randomly selected--the check mark is ambiguous. Does it mean there was random assignment of treatments or random selection of individuals?
WOULD THIS GET CREDIT? (component 2 - Large Counts condition)
Response 1 is correct because all the expected counts (using the combined proportion) are shown to be greater than 10.
Response 2 is correct because all the observed counts are shown to be greater than 10. (Technically, the combined proportion should be used, but this was ignored.)
Response 3 is incorrect because the response does not use a boundary (10 or 5) to check the condition. It is unclear what the response is actually checking: That they are all decimals? That they are all greater than 30? That they are all different from each other?
Response 4 is incorrect because, although it correctly checks all the expected counts, it also (sadly) refers to an inappropriate condition in this context (greater than 30).
Teaching tips:
Oh conditions. How we love you and hate you. One student used “random samplesssss”, which tells me a teacher out there really emphasized that point! Do not allow students to use random (check)--the single check does not work for two samples and it is unclear how the random condition is satisfied.
Consider having students check conditions for one sample at a time. For example, for the 18-55 age group, check the conditions; repeat for the other group. They would likely not miss out on checking the random and 10% condition correctly (like 70% of students! See below).
WOULD THIS GET CREDIT? (component 3 - test statistic)
Response 1 is correct because it shows the correct value of the test statistic. Note: A positive test statistic is also OK since it’s dependent on the order of subtraction that the student is free to choose.
Response 2 is correct because the test statistic is correct, although there was a writing error in the denominator (0.56 should be 0.66) and the standard error should be using the combined proportion instead of the sample proportions.
Response 3 is incorrect because it uses the incorrect notation of T rather than Z.
Response 4 is incorrect because it forgot to identify the test statistic! (which was common among student responses).
WOULD THIS GET CREDIT? (component 4 - p-value)
Response 1 is correct because it identifies the p-value. This “list-the-calculator-output” style of response was very common among student responses.
Response 2 is correct because it provides a one-sided p-value consistent with the one-sided alternative hypothesis the response gave earlier. The probability notation is nice but not required.
Response 3 is incorrect because it uses the test statistic as the p-value.
Response 4 is incorrect because the value of the p-value was not identified, even though the diagram correctly illustrates the area in question. (Note: the diagram is unclear if this is a normal or t-distribution, so -0.92 alone would not get credit for the previous “test statistic” component.)
Teaching tips:
Students spent a lot of time and effort on calculating the test statistic and p-value. Point estimates, standard errors, formulas, formulas with numbers in it, normal curves drawn and nicely labeled, normalcdf commands, etc. There is a lot of value in this, but teachers should inform students what is typically required on the exam. Have a turning point in class when you get to AP exam review; make it a big deal: “Ok class, I’ve been requiring you to show SO much work to get the p-value. I release you. You are free to do the short cut.”
Give students questions where they are provided the test statistic (not the data or counts or stats) and have them calculate the p-value. That’s great MCQ prep and requires they use a -cdf command (or table), rather than a test short-cut like 2PropZTest. That is where you should assess their understanding from the previous bullet, not when they are doing the full inference procedure.
Lastly, tell students to include both the p-value and test statistic! I saw many papers that forgot to include the test statistic, which is likely that the teacher didn’t communicate to them that it’s required.
Section 3 had two components.
(1) Make a correct decision about the null and/or alternative hypothesis based on the p-value AND α-level .
(2) Provide sufficient (i.e. lots of) context and use non-definitive language in terms of the alternative hypothesis. For context, I’ll use PGUV since we like acronyms in AP Stats: P = proportions, G = both age groups, U = the sampling units, i.e., “members”, and V = the variable of interest (“taking online fitness classes”). Without all four of PGUV, they did not earn component (2). (see the 8th bullet on this section of the rubric)
We’ll consider both components at once.
WOULD THIS GET CREDIT? (components 1 and 2 - conclusion)
Response 1 is correct because it compares the p-value to alpha and has sufficient context. The possible reference to the sample (“interested”) would be considered under the first section and ignored here. Although there’s no explicit statement about the null, the first component required a decision about the null or the alternative.
Response 2 is correct because it compares the p-value to alpha and has sufficient context. The response includes an interpretation of the p-value, which was correct.
Response 3 is incorrect because it does not explicitly compare the p-value and alpha (there’s no “>”). Also, the response is incorrect because it uses definitive language, lacking an appropriate adjective like “convincing” to go with “evidence.” The context is also not sufficient, because it lacks mention of the proportion. Note: an incorrect interpretation of the p-value is included (i.e., “0.92, more or less” would be the entire area under the curve!), which would reduce the response one level (from E to P or P to an I).
Response 4 is incorrect because “accept the null” is too definitive, losing the first component. The context component is also lost since they use “people” rather than “members.” Recall that the question is about gym members, not people everywhere.
Teaching tips:
Overall, students did well with comparing the p-value to alpha and choosing “fail to reject”. Many students restated the null (in words) and then the alternative (in words). That’s a lot of writing and time! Encourage students to simply say “the null” to save time.
There is value in encouraging students to state conclusions in their own words. For a test setting, however, your scores will increase if you hold students to the high standard of rewriting the context from the stem of the question. Create an activity where students leave out one part of the context – does that part matter? Could the meaning change? For example, would all these words be acceptable: the number of members, the proportion of members, the percentage of members, the count of members, the mean proportion of members? Or this: is using “people” the same as using “members”?
Rewriting the context from the stem applies to the adjective with “evidence”! The stem said “convincing statistical evidence”; if students copied that wording, they would have avoided using definitive language. This distinction between evidence and convincing evidence is not new, though rubrics differ. See 2021 5th bullet where “evidence” alone is considered non-definitive language, but then check 2023 10th bullet where “evidence” alone is considered definitive language.
Encouraging students to interpret the p-value is conceptually helpful but dangerous if they do it on the exam. Of the students who included a p-value interpretation, 60% got it wrong; remember, the interpretation wasn’t even required here!
Data Analysis:
This analysis is based on over 1,000 papers that I scored. I cannot claim my sample was representative of the population of all exams, so proceed with caution.
Which component was missed by the highest percentage of students?
Only 30% of students got section 2, component 1 correct (checking the random and 10% conditions).That was shocking! Although many students didn’t check conditions at all or forgot one of the two, the most common issue was correctly identifying the plurality of random samples.
The second worst component was the last one: only 37% of students could conclude in context, or I should say, conclude with sufficient (PGUV) context.
Which component did the highest percentage of students earn?
Section 1, component 2 (writing hypotheses) was the easiest for students. 59% were able to write a null of equal proportions and an alternative of unequal proportions.
What component would set apart a student scoring a 3 and a student scoring a 4?
To earn a 3, students would need either an EEP, an EPE, or a PEE. From what I’ve mentioned in bullet 1, you may be able to guess what permutation was most common: EPE, students not checking conditions correctly (or not identifying the test statistic). 65% of my 3s were EPEs. And that’s significant!
Based on a chi-square GOF test, I get a p-value of basically 0, so I have evidence (I mean, convincing evidence) that the distribution of EEP/EPE/PEEs among scores of 3 is not uniform.
What was the most common scoring of a 2 (PPP, EEI, EPI or another permutation)?
There are 12 different permutations that would give a score of 2, the most common that I observed was EPP (29% of scores); but this is no surprise given the first bullet above.
What was the most common scoring for a 1 paper?
To earn a 1, students would need exactly 1 E or exactly two Ps. 40% of the papers I scored had an IPP. If a paper had only the following, it would be minimally enough to earn an IPP = 1:
Z = 0.92, p = 0.36 > alpha, fail to reject H0
2024 #1 Scoring Activity
Whew! That was a lot of information! Do you want to practice scoring FRQs with your students, but don't know where to start? Luke and Lindsey created an activity for scoring the 2024 FRQ in class (using some of Mr. Eicher's detailed analysis of the rubric!). They presented the activity at the 2024 NCTM Annual Conference in Chicago.
Download PowerPoint Slides
We recommend spending one 50-minute class period on this activity towards the end of Math Medic AP Stats Unit 9. After students work through the FRQ, share the handout with the students. The PowerPoint explains the required components for each section of the rubric. Go through one section at a time and then have the students score themselves or their classmates.