Fiveable
Fiveable
AP Statistics

📊ap statistics review

2.5 Correlation

Verified for the 2025 AP Statistics examLast Updated on June 18, 2024

Correlation is a measure of the strength and direction of the relationship between two variables, and this is numerically represented with the correlation coefficient, which in stats we denote as r

The correlation coefficient shows the degree to which there is a linear correlation between the two variables, that is, how close the points are to forming a line. It can be positive or negative and this is the same as the direction of the scatterplot. The coefficient takes a value between -1 and 1, where r = -1 means that the points fall exactly on an decreasing line while r = 1 means that the points fall exactly on a increasing line. A correlation coefficient of 0 means that there is no correlation between the data points. 🌞

It's important to note that the correlation coefficient only measures the linear relationship between two variables. It does not indicate the strength or nature of any nonlinear relationships that may exist. Additionally, the correlation coefficient does not indicate the cause and effect relationship between the two variables. As mentioned before, correlation =/= causation! 🙅

Examples

Here are some scatterplots and their values of r:

image courtesy of: math.nayland.school.nz

Also, there are a few things to keep in mind about correlation. 

  • Even if r has a high magnitude, the relationship may not be linear, but instead it may be curved. We'll discuss this more in later sections.
  • A high magnitude of correlation does not imply causation.
  • The correlation coefficient is not resistant to outliers, which makes sense, given that the formula that we shall learn uses the mean and standard deviation, which by themselves are not resistant.

Calculating the Correlation Coefficient

To find the value of r, we have this formula that is found on the formula sheet:

Although this may seem like a complicated formula, it’s not that bad to understand (but harder to compute) To find r, first find the mean and standard deviations of both the x and y variables. Then, for each data point, multiply the x and y z-scores for that point. Finally, add all the individual products up and divide by the number of data points minus 1. ‍💫

You will seldom need to do this by hand, and most graphing calculators can easily find this. On the most common graphing calculator used in AP Stats (TI-84), you will enter your data into L1 and L2, go to Stats>Calc>LinReg like below: 

To be sure that you get the r-value, verify that "Stats Diagnostics" is on via MODE.

🎥 Watch: AP Stats - Scatterplots and Association

Practice Problems

(1) A study was conducted to examine the relationship between hours of exercise per week and body mass index (BMI). The following scatterplot shows the results of the study for a sample of 25 individuals.

Based on the scatterplot, which of the following statements is true?

(A) There is a strong positive correlation between hours of exercise per week and BMI.

(B) There is a strong negative correlation between hours of exercise per week and BMI.

(C) There is a moderate positive correlation between hours of exercise per week and BMI.

(D) There is a moderate negative correlation between hours of exercise per week and BMI.

(E) There is no correlation between hours of exercise per week and BMI.

(2) TRUE or FALSE

  • A scatterplot is a graphical representation of the relationship between two variables.
  • A correlation coefficient of 1 indicates a strong positive correlation between two variables.
  • A correlation coefficient of -1 indicates a strong positive correlation between two variables.
  • A correlation coefficient of 0 indicates no correlation between two variables.
  • The correlation coefficient only measures linear relationships between two variables.
  • The correlation coefficient indicates the strength and direction of the relationship between two variables.
  • The correlation coefficient indicates the cause and effect relationship between two variables.
  • Correlation implies causation, meaning that if two variables are correlated, one variable must cause the other.
  • A scatterplot can show nonlinear relationships between two variables.
  • A scatterplot can be used to predict the value of one variable based on the value of the other variable.

Answers

(1) In this case, the correct answer would be (D) There is a moderate negative correlation between hours of exercise per week and BMI. This can be seen from the scatterplot, which shows that as the hours of exercise per week increase, the BMI decreases. The relationship between the two variables is not perfectly linear, but there is a clear trend in the data. The correlation coefficient could be calculated to quantify the strength of this relationship.

(2) T, T, F, T, T, T, F, F, T, T

[

](https://app.fiveable.me/ap-stats/unit-2/scatterplots-association/watch/KqjaqR3xVBs7M1J6oHJN)[](https://maths.nayland.school.nz/Year_13_Maths/3.9_Bivariate_data/7_Corr_Coefficient_r.html)

Key Terms to Review (10)

AP Stats (Advanced Placement Statistics): AP Stats is a college-level course designed to introduce students to the concepts and tools of statistics. It focuses on data collection, analysis, interpretation, and drawing conclusions based on statistical reasoning, providing a foundation for understanding real-world data and its applications.
BMI (Body Mass Index): BMI, or Body Mass Index, is a numerical value derived from an individual's weight and height, calculated by dividing a person's weight in kilograms by the square of their height in meters ($$BMI = \frac{weight (kg)}{height (m)^2}$$). It serves as a screening tool to categorize individuals into different weight classifications, such as underweight, normal weight, overweight, and obesity. Understanding BMI is important for assessing health risks related to body weight and can help in evaluating the correlation between weight and various health outcomes.
Causation: Causation refers to the relationship between two variables where one variable directly influences or produces a change in another variable. This concept is crucial in statistics, as it helps distinguish between mere correlation and a true cause-and-effect relationship. Understanding causation is vital for drawing conclusions from data, making predictions, and informing decision-making processes.
Correlation: Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It helps in understanding how changes in one variable may be associated with changes in another, making it a key concept in evaluating relationships in data analysis and regression models.
Linear Correlation: Linear correlation is a statistical measure that describes the strength and direction of a linear relationship between two quantitative variables. This relationship is typically assessed using the correlation coefficient, which can range from -1 to 1, indicating perfect negative correlation and perfect positive correlation, respectively. Understanding linear correlation helps in predicting one variable based on another and is essential for analyzing data trends.
LinReg (Linear Regression): Linear Regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. This technique helps in predicting the value of the dependent variable based on the values of the independent variables, revealing trends and correlations within the data.
Mean: The mean is a measure of central tendency that represents the average value of a set of numbers, calculated by summing all values and dividing by the total number of values. It plays a crucial role in understanding data distributions, analyzing sampling distributions, and making inferences about populations based on sample data.
Outliers: Outliers are data points that significantly differ from the rest of the data in a dataset. These values can have a major impact on statistical analyses and can affect the results of regression models, correlation calculations, and various statistical tests. Identifying outliers is crucial, as they can indicate variability in measurements, experimental errors, or novel insights into the data.
Standard Deviation: Standard deviation is a measure of the amount of variation or dispersion in a set of values. It indicates how much the individual data points deviate from the mean, providing insight into the spread of data in various statistical contexts such as regression, confidence intervals, and hypothesis testing.
Z-scores: A Z-score is a statistical measurement that describes a value's relation to the mean of a group of values, expressing how many standard deviations a data point is from the mean. This concept helps in understanding the relative position of a score within a dataset, allowing for comparisons across different distributions.