Lesson Notes By Weeks and Term v5 - Grade 12

Statistics: regression and correlation – Week 4 focus

Download the Lessonotes Mobile South Africa app for faster lesson access on Android and iPhone.

Subject: Mathematics

Class: Grade 12

Term: 3rd Term

Week: 4

Theme: General lesson support

Lesson Video

This page supports the lesson note with a companion video and a short classroom-ready summary.

For class groups and homework, share this lesson page so learners also get the summary, objectives, and full lesson context.

Performance objectives

Lesson summary

This week, we delve into the fascinating world of regression and correlation. These statistical tools allow us to explore and quantify the relationship between two sets of data, identifying patterns and making predictions. Understanding these concepts is crucial not only for academic success but also for interpreting data in various real-world scenarios, from economic trends to social patterns and environmental changes within South Africa. For instance, we can investigate the relationship between unemployment rates and crime statistics or the correlation between rainfall and crop yields in different provinces.

Lesson notes

Correlation: Correlation measures the strength and direction of a linear relationship between two variables. The correlation coefficient, denoted by r, ranges from -1 to +1. r close to +1: Strong positive correlation (as one variable increases, the other tends to increase). r close to -1: Strong negative correlation (as one variable increases, the other tends to decrease). r close to 0: Weak or no linear correlation.

Important: Correlation does not imply causation. Just because two variables are correlated doesn't mean that one causes the other. There might be a lurking variable influencing both. Calculating the Correlation Coefficient (r): The formula for calculating r is: r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² Σ(yᵢ - ȳ)²] Where: xᵢ and yᵢ are the individual data points. x̄ and ȳ are the means of the x and y data sets, respectively. Σ represents the sum.

Example 1: Let's say we want to investigate the correlation between the number of hours a Grade 12 student spends studying per week (x) and their mathematics exam score (y).

We collect data from 5 students: | Student | Hours Studying (x) | Exam Score (y) | |---|---|---| | 1 | 5 | 60 | | 2 | 10 | 75 | | 3 | 15 | 85 | | 4 | 20 | 90 | | 5 | 25 | 95 | Calculate the means: x̄ = (5+10+15+20+25)/5 = 15; ȳ = (60+75+85+90+95)/5 = 81 Calculate the terms for the formula: | Student | xᵢ - x̄ | yᵢ - ȳ | (xᵢ - x̄)(yᵢ - ȳ) | (xᵢ - x̄)² | (yᵢ - ȳ)² | |---|---|---|---|---|---| | 1 | -10 | -21 | 210 | 100 | 441 | | 2 | -5 | -6 | 30 | 25 | 36 | | 3 | 0 | 4 | 0 | 0 | 16 | | 4 | 5 | 9 | 45 | 25 | 81 | | 5 | 10 | 14 | 140 | 100 | 196 | | Sum | | | 425 | 250 | 770 | Calculate r: r = 425 / √(250 * 770) = 425 / √192500 ≈ 425 / 438.75 ≈ 0.97 Interpretation: r = 0.97 indicates a strong positive correlation between hours studying and exam scores. This means that students who study more tend to score higher on the exam.

Regression: Regression analysis is used to find the best-fitting line (the regression line) that describes the relationship between two variables. This line can then be used to predict the value of one variable (the dependent variable, y) based on the value of the other variable (the independent variable, x).

The Least Squares Regression Line: The equation of the least squares regression line is: y = a + bx Where: y is the predicted value of the dependent variable. x is the value of the independent variable. a is the y-intercept (the value of y when x = 0). b is the slope (the change in y for every one-unit increase in x).

Calculating a and b: b = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / Σ(xᵢ - x̄)² = covariance(x,y) / variance(x) a = ȳ - b * x̄ Example 2: Using the same data from Example 1, let's find the equation of the least squares regression line.

We already calculated: x̄ = 15, ȳ = 81, Σ[(xᵢ - x̄)(yᵢ - ȳ)] = 425, and Σ(xᵢ - x̄)² = 250 Calculate b: b = 425 / 250 = 1.7 Calculate a: a = 81 - (1.7 * 15) = 81 - 25.5 = 55.5 Therefore, the equation of the least squares regression line is: y = 55.5 + 1.7x Interpretation: The slope (b = 1.7) means that for every additional hour a student studies, their exam score is predicted to increase by 1.7 points. The y-intercept (a = 55.5) represents the predicted exam score for a student who studies 0 hours (although this might not be practically meaningful in this context).

Making Predictions: We can use the regression equation to predict exam scores. For example, if a student studies for 12 hours, we predict their exam score to be: y = 55.5 + 1.7(12) = 55.5 + 20.4 = 75.

9. Outliers: Outliers are data points that lie far away from the rest of the data. They can significantly affect the correlation and regression results. It's important to identify and consider outliers carefully. Sometimes they represent errors in the data, while other times they represent genuinely unusual cases.

Example 3: The impact of an Outlier Suppose we add a sixth student to our dataset from Example

1. This student studies 40 hours a week (x=40) but only obtains a score of 70 (y=70). This is quite unusual. Recalculating the regression line with this additional data point will demonstrate how drastically an outlier can alter the results. (This is left as an exercise for the reader). Guided Practice (With Solutions)

Question 1: A researcher wants to study the relationship between household income (in thousands of Rand) and monthly spending on groceries (in Rand).

They collect data from 8 households: | Household | Income (x) | Grocery Spending (y) | |---|---|---| | 1 | 15 | 2000 | | 2 | 25 | 3000 | | 3 | 35 | 3500 | | 4 | 40 | 4000 | | 5 | 45 | 4200 | | 6 | 50 | 4500 | | 7 | 55 | 5000 | | 8 | 60 | 5500 | Calculate the correlation coefficient (r).