Lesson Notes By Weeks and Term v5 - Grade 12

Statistics: regression and correlation – Week 5 focus

Download the Lessonotes Mobile South Africa app for faster lesson access on Android and iPhone.

Subject: Mathematics

Class: Grade 12

Term: 3rd Term

Week: 5

Theme: General lesson support

Lesson Video

This page supports the lesson note with a companion video and a short classroom-ready summary.

For class groups and homework, share this lesson page so learners also get the summary, objectives, and full lesson context.

Performance objectives

Lesson summary

This week, we delve into the fascinating world of Regression and Correlation. This is a powerful branch of statistics that allows us to explore relationships between two or more variables. In everyday language, we're trying to figure out if changes in one thing (like hours studied) can help us predict changes in another (like exam score). Why is this important? Imagine predicting the price of maize based on rainfall levels, forecasting job growth based on education levels, or understanding the relationship between crime rates and socio-economic factors in South Africa. Regression and correlation provide the tools to analyze these connections and make informed predictions.

Lesson notes

2.1 Correlation: Correlation describes the strength and direction of a linear relationship between two numerical variables. It does not imply causation. The correlation coefficient, denoted by 'r', is a number between -1 and +1: r > 0: Positive correlation (as one variable increases, the other tends to increase). r < 0: Negative correlation (as one variable increases, the other tends to decrease). r = 0: No linear correlation (doesn't mean there's no relationship at all, just no linear one). r close to +1: Strong positive correlation. r close to -1: Strong negative correlation. r close to 0: Weak or no linear correlation. Formula for the Pearson Correlation Coefficient (r): r = ∑[(xᵢ - x̄)(yᵢ - ȳ)] / √[∑(xᵢ - x̄)² ∑(yᵢ - ȳ)²] Where: xᵢ and yᵢ are the individual data points x̄ and ȳ are the means of the x and y values respectively ∑ denotes summation over all data points Example 1: Calculating the correlation coefficient Let's say we want to investigate the relationship between the number of hours a student studies and their test score.

We have the following data for 5 students: | Student | Hours Studied (x) | Test Score (y) | |---|---|---| | 1 | 2 | 60 | | 2 | 4 | 70 | | 3 | 6 | 80 | | 4 | 8 | 90 | | 5 | 10 | 100 | Step 1: Calculate the means (x̄ and ȳ) x̄ = (2+4+6+8+10)/5 = 6 ȳ = (60+70+80+90+100)/5 = 80 Step 2: Calculate (xᵢ - x̄), (yᵢ - ȳ), (xᵢ - x̄)², (yᵢ - ȳ)² and (xᵢ - x̄)(yᵢ - ȳ) for each student: | Student | xᵢ - x̄ | yᵢ - ȳ | (xᵢ - x̄)² | (yᵢ - ȳ)² | (xᵢ - x̄)(yᵢ - ȳ) | |---|---|---|---|---|---| | 1 | -4 | -20 | 16 | 400 | 80 | | 2 | -2 | -10 | 4 | 100 | 20 | | 3 | 0 | 0 | 0 | 0 | 0 | | 4 | 2 | 10 | 4 | 100 | 20 | | 5 | 4 | 20 | 16 | 400 | 80 | Step 3: Sum the columns (except student): ∑(xᵢ - x̄)² = 40 ∑(yᵢ - ȳ)² = 1000 ∑(xᵢ - x̄)(yᵢ - ȳ) = 200 Step 4: Substitute into the formula: r = 200 / √(40 * 1000) = 200 / √40000 = 200 / 200 = 1 Therefore, r =

1. This indicates a perfect positive linear correlation. 2.2 Regression: Regression analysis aims to find the best-fitting line (or curve in more advanced cases) that describes the relationship between the variables. We focus on linear regression, where we find a straight line. This line is called the least squares regression line or the line of best fit.

The equation of this line is: y = a + bx Where: y is the dependent variable (the one we're trying to predict). x is the independent variable (the one we're using to make the prediction). a is the y-intercept (the value of y when x = 0). b is the slope (the change in y for every one-unit change in x).

Formula for the slope (b): b = ∑[(xᵢ - x̄)(yᵢ - ȳ)] / ∑(xᵢ - x̄)² Notice the numerator is the same as in the correlation coefficient formula.

Formula for the y-intercept (a): a = ȳ - b * x̄ Example 2: Finding the regression line using the data from Example 1: We already calculated ∑[(xᵢ - x̄)(yᵢ - ȳ)] = 200 and ∑(xᵢ - x̄)² =

4

0. Also, x̄ = 6 and ȳ =

8

0. Step 1: Calculate the slope (b): b = 200 / 40 = 5 Step 2: Calculate the y-intercept (a): a = 80 - 5 * 6 = 80 - 30 = 50 Step 3: Write the equation of the regression line: y = 50 + 5x This means that for every hour a student studies, their test score is predicted to increase by 5 points. If a student studies 0 hours, their predicted score is

5

0. Example 3: Regression with a South African Context Let's consider the relationship between monthly rainfall (in mm) and maize yield (in tons per hectare) in a specific region of South Africa. Suppose we have the following data for the past 5 years: | Year | Rainfall (mm) (x) | Maize Yield (tons/ha) (y) | |---|---|---| | 1 | 50 | 2 | | 2 | 70 | 3 | | 3 | 90 | 4 | | 4 | 110 | 5 | | 5 | 130 | 6 | Following the same steps as above (calculating means, differences, sums, etc.), we find: x̄ = 90 ȳ = 4 b = 0.05 a = -0.5 The regression equation is: y = -0.5 + 0.05x This suggests that for every 1 mm increase in rainfall, the maize yield is predicted to increase by 0.05 tons per hectare. The y-intercept of -0.5 is not practically meaningful here (negative yield). The equation is only valid within the observed range of rainfall values. 2.3 Important Considerations: Correlation does NOT equal causation. Just because two variables are correlated does not mean that one causes the other. There might be other factors influencing both variables (confounding variables).

Extrapolation: Be cautious when making predictions outside the range of your data. The relationship might not hold true beyond the observed values.

Linearity: Regression analysis assumes a linear relationship. If the relationship is non-linear, a linear regression model will not be appropriate. Scatter plots are very useful for checking if a linear relationship is reasonable.

Outliers: Outliers can drastically influence the regression line. It's important to identify outliers and consider their impact on your analysis. Guided Practice (With Solutions)

Question 1: The table below shows the number of learners in a class and the average exam score for those learners.