Probability and Statistics

10.3 Simple linear regression model

Citation:

Simple linear regression models the relationship between two continuous variables, assuming a linear connection. It's a fundamental tool for predicting and analyzing how changes in one variable affect another.

This technique uses an explanatory variable to predict a response variable. By fitting a line of best fit, calculating slope and y-intercept, and assessing model fit, we can understand and quantify the relationship between variables.

Simple linear regression

Fundamental statistical technique used to model and analyze the relationship between two continuous variables
Assumes a linear relationship exists between the explanatory variable (X) and the response variable (Y)
Enables predictions and inferences about the response variable based on the values of the explanatory variable

Relationship between variables

Examines how changes in one variable are associated with changes in another variable
Determines the strength and direction of the relationship between the explanatory and response variables

Explanatory vs response variables

Explanatory variable (X) is the independent variable that is used to explain or predict changes in the response variable
Response variable (Y) is the dependent variable that is being explained or predicted by the explanatory variable
Example: In a study of the relationship between study time (X) and exam scores (Y), study time is the explanatory variable, and exam scores are the response variable

Scatterplots

Graphical representation of the relationship between two continuous variables
Each data point represents a pair of values for the explanatory and response variables
Scatterplots help visualize the strength, direction, and shape of the relationship between variables
Example: A scatterplot of height (X) and weight (Y) data points can reveal a positive linear relationship

Line of best fit

Represents the linear model that best describes the relationship between the explanatory and response variables
Determined by finding the line that minimizes the sum of squared residuals (differences between observed and predicted values)

Slope and y-intercept

Slope ( $\beta_1$ $β_{1}$ ) represents the change in the response variable (Y) for a one-unit increase in the explanatory variable (X)
- Indicates the direction and strength of the linear relationship
Y-intercept ( $\beta_0$ $β_{0}$ ) represents the predicted value of the response variable when the explanatory variable is zero
- Provides a reference point for the line of best fit

Residuals and errors

Residuals are the differences between the observed values of the response variable and the predicted values from the regression line
Errors represent the unexplained variability in the response variable that is not accounted for by the linear model
Smaller residuals and errors indicate a better fit of the model to the data

Least squares method

Statistical method used to estimate the parameters (slope and y-intercept) of the line of best fit
Minimizes the sum of squared residuals to find the optimal line that best fits the data

Minimizing sum of squared residuals

The line of best fit is chosen by finding the values of the slope and y-intercept that minimize the sum of squared residuals
Squaring the residuals ensures that positive and negative residuals do not cancel each other out
Minimizing the sum of squared residuals provides the most accurate estimates of the model parameters

Assessing model fit

Evaluating how well the linear regression model fits the observed data
Determines the proportion of variability in the response variable that is explained by the explanatory variable

Coefficient of determination (R-squared)

Measures the proportion of variability in the response variable that is explained by the linear regression model
Ranges from 0 to 1, with higher values indicating a better fit of the model to the data
Calculated as the ratio of the explained variance to the total variance

Interpretation of R-squared

An R-squared value of 0 indicates that the linear model does not explain any of the variability in the response variable
An R-squared value of 1 indicates that the linear model perfectly explains all the variability in the response variable
Example: An R-squared value of 0.75 means that 75% of the variability in the response variable is explained by the linear model

Correlation coefficient

Measures the strength and direction of the linear relationship between two continuous variables
Ranges from -1 to 1, with values closer to -1 or 1 indicating a stronger linear relationship

Pearson correlation coefficient

Most commonly used correlation coefficient for simple linear regression
Measures the strength and direction of the linear relationship between the explanatory and response variables
Calculated using the covariance of the two variables divided by the product of their standard deviations

Interpretation of correlation

A correlation coefficient of 1 indicates a perfect positive linear relationship
A correlation coefficient of -1 indicates a perfect negative linear relationship
A correlation coefficient of 0 indicates no linear relationship between the variables
Example: A correlation coefficient of 0.8 suggests a strong positive linear relationship between the variables

Hypothesis tests

Statistical procedures used to test the significance of the relationship between the explanatory and response variables
Determine whether the observed relationship is likely to have occurred by chance or if it represents a true relationship in the population

Significance of slope

Tests the null hypothesis that the slope of the regression line is equal to zero (no linear relationship)
If the p-value is less than the chosen significance level (e.g., 0.05), the null hypothesis is rejected, indicating a significant linear relationship

Confidence intervals

Provide a range of plausible values for the slope and y-intercept of the regression line
Indicate the precision and uncertainty associated with the estimated model parameters
Example: A 95% confidence interval for the slope of (0.5, 1.2) suggests that the true slope is likely to fall within this range with 95% confidence

Checking model assumptions

Verifying that the assumptions underlying simple linear regression are met to ensure the validity of the model and its inferences
Violations of assumptions can lead to biased or unreliable results

Linearity

The relationship between the explanatory and response variables should be linear
Scatterplots can be used to visually assess linearity
Residual plots (residuals vs. explanatory variable) can also help detect non-linearity

Independence of errors

The errors (residuals) should be independent of each other
Violations can occur when data points are collected over time or have a spatial relationship
Durbin-Watson test can be used to assess the independence of errors

Constant variance of errors

The variability of the errors should be constant across all levels of the explanatory variable (homoscedasticity)
Non-constant variance (heteroscedasticity) can be detected using residual plots (residuals vs. fitted values)

Normality of errors

The errors should be normally distributed with a mean of zero
Normal probability plots or histograms of the residuals can be used to assess normality
Shapiro-Wilk or Kolmogorov-Smirnov tests can formally test for normality of errors

Outliers and influential points

Observations that deviate substantially from the overall pattern of the data or have a disproportionate impact on the regression model
Can affect the estimates of the model parameters and the goodness of fit

Identifying outliers

Outliers can be identified using scatterplots or residual plots
Points that are far from the majority of the data or have large residuals may be considered outliers
Example: In a scatterplot of height and weight, a data point with a height of 200 cm and a weight of 50 kg would be an outlier

Leverage and influence

Leverage measures the distance of an observation from the mean of the explanatory variable
High leverage points can have a strong influence on the regression line
Influence measures the impact of an observation on the model parameters or fitted values
Cook's distance is a measure that combines leverage and residuals to assess the overall influence of an observation

Predictions using regression model

Using the estimated regression equation to predict the value of the response variable for a given value of the explanatory variable
Allows for interpolation and extrapolation based on the observed data

Interpolation vs extrapolation

Interpolation involves making predictions within the range of the observed explanatory variable values
Extrapolation involves making predictions beyond the range of the observed explanatory variable values
Extrapolation carries more uncertainty and should be done with caution

Limitations of simple linear regression

Assumes a linear relationship between the explanatory and response variables, which may not always be appropriate
Does not account for the influence of other variables that may affect the response variable
Sensitive to outliers and influential points, which can distort the model estimates
Limited to modeling the relationship between two continuous variables
Causal inferences cannot be made solely based on the regression results, as correlation does not imply causation

Table of Contents

📊probability and statistics review