Simple linear regression models the relationship between two continuous variables, assuming a linear connection. It's a fundamental tool for predicting and analyzing how changes in one variable affect another.
This technique uses an explanatory variable to predict a response variable. By fitting a line of best fit, calculating slope and y-intercept, and assessing model fit, we can understand and quantify the relationship between variables.
Simple linear regression
- Fundamental statistical technique used to model and analyze the relationship between two continuous variables
- Assumes a linear relationship exists between the explanatory variable (X) and the response variable (Y)
- Enables predictions and inferences about the response variable based on the values of the explanatory variable
Relationship between variables
- Examines how changes in one variable are associated with changes in another variable
- Determines the strength and direction of the relationship between the explanatory and response variables
Explanatory vs response variables
- Explanatory variable (X) is the independent variable that is used to explain or predict changes in the response variable
- Response variable (Y) is the dependent variable that is being explained or predicted by the explanatory variable
- Example: In a study of the relationship between study time (X) and exam scores (Y), study time is the explanatory variable, and exam scores are the response variable
Scatterplots
- Graphical representation of the relationship between two continuous variables
- Each data point represents a pair of values for the explanatory and response variables
- Scatterplots help visualize the strength, direction, and shape of the relationship between variables
- Example: A scatterplot of height (X) and weight (Y) data points can reveal a positive linear relationship
Line of best fit
- Represents the linear model that best describes the relationship between the explanatory and response variables
- Determined by finding the line that minimizes the sum of squared residuals (differences between observed and predicted values)
Slope and y-intercept
- Slope (β1) represents the change in the response variable (Y) for a one-unit increase in the explanatory variable (X)
- Indicates the direction and strength of the linear relationship
- Y-intercept (β0) represents the predicted value of the response variable when the explanatory variable is zero
- Provides a reference point for the line of best fit
Residuals and errors
- Residuals are the differences between the observed values of the response variable and the predicted values from the regression line
- Errors represent the unexplained variability in the response variable that is not accounted for by the linear model
- Smaller residuals and errors indicate a better fit of the model to the data
Least squares method
- Statistical method used to estimate the parameters (slope and y-intercept) of the line of best fit
- Minimizes the sum of squared residuals to find the optimal line that best fits the data
Minimizing sum of squared residuals
- The line of best fit is chosen by finding the values of the slope and y-intercept that minimize the sum of squared residuals
- Squaring the residuals ensures that positive and negative residuals do not cancel each other out
- Minimizing the sum of squared residuals provides the most accurate estimates of the model parameters
Assessing model fit
- Evaluating how well the linear regression model fits the observed data
- Determines the proportion of variability in the response variable that is explained by the explanatory variable
Coefficient of determination (R-squared)
- Measures the proportion of variability in the response variable that is explained by the linear regression model
- Ranges from 0 to 1, with higher values indicating a better fit of the model to the data
- Calculated as the ratio of the explained variance to the total variance
Interpretation of R-squared
- An R-squared value of 0 indicates that the linear model does not explain any of the variability in the response variable
- An R-squared value of 1 indicates that the linear model perfectly explains all the variability in the response variable
- Example: An R-squared value of 0.75 means that 75% of the variability in the response variable is explained by the linear model
Correlation coefficient
- Measures the strength and direction of the linear relationship between two continuous variables
- Ranges from -1 to 1, with values closer to -1 or 1 indicating a stronger linear relationship
Pearson correlation coefficient
- Most commonly used correlation coefficient for simple linear regression
- Measures the strength and direction of the linear relationship between the explanatory and response variables
- Calculated using the covariance of the two variables divided by the product of their standard deviations
Interpretation of correlation
- A correlation coefficient of 1 indicates a perfect positive linear relationship
- A correlation coefficient of -1 indicates a perfect negative linear relationship
- A correlation coefficient of 0 indicates no linear relationship between the variables
- Example: A correlation coefficient of 0.8 suggests a strong positive linear relationship between the variables
Hypothesis tests
- Statistical procedures used to test the significance of the relationship between the explanatory and response variables
- Determine whether the observed relationship is likely to have occurred by chance or if it represents a true relationship in the population
Significance of slope
- Tests the null hypothesis that the slope of the regression line is equal to zero (no linear relationship)
- If the p-value is less than the chosen significance level (e.g., 0.05), the null hypothesis is rejected, indicating a significant linear relationship
Confidence intervals
- Provide a range of plausible values for the slope and y-intercept of the regression line
- Indicate the precision and uncertainty associated with the estimated model parameters
- Example: A 95% confidence interval for the slope of (0.5, 1.2) suggests that the true slope is likely to fall within this range with 95% confidence
Checking model assumptions
- Verifying that the assumptions underlying simple linear regression are met to ensure the validity of the model and its inferences
- Violations of assumptions can lead to biased or unreliable results
Linearity
- The relationship between the explanatory and response variables should be linear
- Scatterplots can be used to visually assess linearity
- Residual plots (residuals vs. explanatory variable) can also help detect non-linearity
Independence of errors
- The errors (residuals) should be independent of each other
- Violations can occur when data points are collected over time or have a spatial relationship
- Durbin-Watson test can be used to assess the independence of errors
Constant variance of errors
- The variability of the errors should be constant across all levels of the explanatory variable (homoscedasticity)
- Non-constant variance (heteroscedasticity) can be detected using residual plots (residuals vs. fitted values)
Normality of errors
- The errors should be normally distributed with a mean of zero
- Normal probability plots or histograms of the residuals can be used to assess normality
- Shapiro-Wilk or Kolmogorov-Smirnov tests can formally test for normality of errors
Outliers and influential points
- Observations that deviate substantially from the overall pattern of the data or have a disproportionate impact on the regression model
- Can affect the estimates of the model parameters and the goodness of fit
Identifying outliers
- Outliers can be identified using scatterplots or residual plots
- Points that are far from the majority of the data or have large residuals may be considered outliers
- Example: In a scatterplot of height and weight, a data point with a height of 200 cm and a weight of 50 kg would be an outlier
Leverage and influence
- Leverage measures the distance of an observation from the mean of the explanatory variable
- High leverage points can have a strong influence on the regression line
- Influence measures the impact of an observation on the model parameters or fitted values
- Cook's distance is a measure that combines leverage and residuals to assess the overall influence of an observation
Predictions using regression model
- Using the estimated regression equation to predict the value of the response variable for a given value of the explanatory variable
- Allows for interpolation and extrapolation based on the observed data
- Interpolation involves making predictions within the range of the observed explanatory variable values
- Extrapolation involves making predictions beyond the range of the observed explanatory variable values
- Extrapolation carries more uncertainty and should be done with caution
Limitations of simple linear regression
- Assumes a linear relationship between the explanatory and response variables, which may not always be appropriate
- Does not account for the influence of other variables that may affect the response variable
- Sensitive to outliers and influential points, which can distort the model estimates
- Limited to modeling the relationship between two continuous variables
- Causal inferences cannot be made solely based on the regression results, as correlation does not imply causation