Lesson Notes By Weeks and Term v5 - Grade 12

Statistics: regression and correlation – Week 3 focus

Download the Lessonotes Mobile South Africa app for faster lesson access on Android and iPhone.

Subject: Mathematics

Class: Grade 12

Term: 3rd Term

Week: 3

Theme: General lesson support

Lesson Video

This page supports the lesson note with a companion video and a short classroom-ready summary.

For class groups and homework, share this lesson page so learners also get the summary, objectives, and full lesson context.

Performance objectives

Lesson summary

This week, we delve deeper into regression and correlation, building upon the foundational statistical knowledge acquired in previous grades. Regression analysis allows us to model the relationship between two variables, enabling predictions and understanding how changes in one variable affect another. Correlation quantifies the strength and direction of this relationship. This is vital for understanding trends, making informed decisions, and solving real-world problems, from predicting crop yields based on rainfall to analyzing economic indicators.

Lesson notes

2.1 Correlation Coefficient (Pearson's r): The correlation coefficient, denoted by r, measures the strength and direction of a linear relationship between two variables, x and y. It ranges from -1 to +1. r = +1: Perfect positive linear correlation (as x increases, y increases perfectly). r = -1: Perfect negative linear correlation (as x increases, y decreases perfectly). r = 0: No linear correlation.

However, there may be a non-linear relationship. 0 < r < 1: Positive correlation (as x increases, y tends to increase). -1 < r < 0: Negative correlation (as x increases, y tends to decrease).

Formula for Pearson's r: ``` r = ∑[(xi - x̄)(yi - ȳ)] / √[∑(xi - x̄)² ∑(yi - ȳ)²] ``` Where: xi represents the individual x values. yi represents the individual y values. x̄ represents the mean of the x values. ȳ represents the mean of the y values. ∑ represents the summation.

Example 1: Calculating the correlation coefficient. Consider the following data representing the number of hours studied (x) and the exam score (y) for 5 students: | Student | Hours Studied (x) | Exam Score (y) | |---------|--------------------|-----------------| | 1 | 2 | 55 | | 2 | 4 | 70 | | 3 | 6 | 80 | | 4 | 8 | 85 | | 5 | 10 | 90 | Calculate the means: x̄* = (2+4+6+8+10)/5 = 6 ȳ = (55+70+80+85+90)/5 = 76 Calculate (xi - x̄)(yi - ȳ), (xi - x̄)², and (yi - ȳ)² for each student: | Student | xi | yi | xi - x̄ | yi - ȳ | (xi - x̄)(yi - ȳ) | (xi - x̄)² | (yi - ȳ)² | |---------|-----|-----|---------|---------|--------------------|-----------|-----------| | 1 | 2 | 55 | -4 | -21 | 84 | 16 | 441 | | 2 | 4 | 70 | -2 | -6 | 12 | 4 | 36 | | 3 | 6 | 80 | 0 | 4 | 0 | 0 | 16 | | 4 | 8 | 85 | 2 | 9 | 18 | 4 | 81 | | 5 | 10 | 90 | 4 | 14 | 56 | 16 | 196 | Sum the columns: ∑[(xi - x̄)(yi - ȳ)] = 84 + 12 + 0 + 18 + 56 = 170 ∑(xi - x̄)² = 16 + 4 + 0 + 4 + 16 = 40 ∑(yi - ȳ)² = 441 + 36 + 16 + 81 + 196 = 770 Calculate r: ``` r = 170 / √(40 * 770) = 170 / √30800 = 170 / 175.499 = 0.969 ``` Therefore, the correlation coefficient r is approximately 0.

9

6

9. This indicates a strong positive linear correlation between the number of hours studied and the exam score. 2.2 Least Squares Regression Line: The least squares regression line (also known as the line of best fit) is the line that minimizes the sum of the squared differences between the observed y values and the y values predicted by the line.

It's represented by the equation: ``` ŷ = a + bx ``` Where: ŷ (y-hat) is the predicted value of y for a given value of x. a is the y-intercept (the value of ŷ when x = 0). b is the slope (the change in ŷ for a one-unit increase in x). x is the independent variable.

Formulas for calculating a and b: ``` b = ∑[(xi - x̄)(yi - ȳ)] / ∑(xi - x̄)² = r * (Sy/Sx) ``` ``` a = ȳ - b * x̄ ``` Where: Sy is the standard deviation of the y values Sx is the standard deviation of the x values Example 2: Finding the equation of the regression line (using the data from Example 1). We already calculated ∑[(xi - x̄)(yi - ȳ)] = 170 and ∑(xi - x̄)² = 40 from Example

1. We also know x̄ = 6 and ȳ =

7

6. Calculate the slope (b): ``` b = 170 / 40 = 4.25 ``` Calculate the y-intercept (a): ``` a = 76 - (4.25 * 6) = 76 - 25.5 = 50.5 ``` Therefore, the equation of the least squares regression line is: ``` ŷ = 50.5 + 4.25x ``` Interpretation: For every additional hour studied, the exam score is predicted to increase by 4.25 points. A student who studies zero hours is predicted to score 50.5, which may not be realistic in this scenario. 2.3 Correlation vs.

Causation: It's crucial to understand that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. There may be a lurking variable (a third variable) that is influencing both variables.

Example: A study finds a positive correlation between ice cream sales and crime rates. Does this mean that eating ice cream causes crime? Probably not. A lurking variable, such as temperature, likely influences both ice cream sales (higher temperatures lead to more ice cream sales) and crime rates (higher temperatures can lead to more social interaction and potentially more crime). 2.4 Limitations of Regression Analysis: Linearity: Regression analysis assumes a linear relationship between the variables. If the relationship is non-linear, the regression line may not be a good fit.

Outliers: Outliers (extreme values) can significantly affect the regression line and the correlation coefficient.

Extrapolation: Extrapolating (making predictions outside the range of the data) can be unreliable. The relationship between the variables may change outside the observed range.