Probability and Statistics
Table of Contents

Simple linear regression models the relationship between two continuous variables, assuming a linear connection. It's a fundamental tool for predicting and analyzing how changes in one variable affect another.

This technique uses an explanatory variable to predict a response variable. By fitting a line of best fit, calculating slope and y-intercept, and assessing model fit, we can understand and quantify the relationship between variables.

Simple linear regression

  • Fundamental statistical technique used to model and analyze the relationship between two continuous variables
  • Assumes a linear relationship exists between the explanatory variable (X) and the response variable (Y)
  • Enables predictions and inferences about the response variable based on the values of the explanatory variable

Relationship between variables

  • Examines how changes in one variable are associated with changes in another variable
  • Determines the strength and direction of the relationship between the explanatory and response variables

Explanatory vs response variables

  • Explanatory variable (X) is the independent variable that is used to explain or predict changes in the response variable
  • Response variable (Y) is the dependent variable that is being explained or predicted by the explanatory variable
  • Example: In a study of the relationship between study time (X) and exam scores (Y), study time is the explanatory variable, and exam scores are the response variable

Scatterplots

  • Graphical representation of the relationship between two continuous variables
  • Each data point represents a pair of values for the explanatory and response variables
  • Scatterplots help visualize the strength, direction, and shape of the relationship between variables
  • Example: A scatterplot of height (X) and weight (Y) data points can reveal a positive linear relationship

Line of best fit

  • Represents the linear model that best describes the relationship between the explanatory and response variables
  • Determined by finding the line that minimizes the sum of squared residuals (differences between observed and predicted values)

Slope and y-intercept

  • Slope (β1\beta_1) represents the change in the response variable (Y) for a one-unit increase in the explanatory variable (X)
    • Indicates the direction and strength of the linear relationship
  • Y-intercept (β0\beta_0) represents the predicted value of the response variable when the explanatory variable is zero
    • Provides a reference point for the line of best fit

Residuals and errors

  • Residuals are the differences between the observed values of the response variable and the predicted values from the regression line
  • Errors represent the unexplained variability in the response variable that is not accounted for by the linear model
  • Smaller residuals and errors indicate a better fit of the model to the data

Least squares method

  • Statistical method used to estimate the parameters (slope and y-intercept) of the line of best fit
  • Minimizes the sum of squared residuals to find the optimal line that best fits the data

Minimizing sum of squared residuals

  • The line of best fit is chosen by finding the values of the slope and y-intercept that minimize the sum of squared residuals
  • Squaring the residuals ensures that positive and negative residuals do not cancel each other out
  • Minimizing the sum of squared residuals provides the most accurate estimates of the model parameters

Assessing model fit

  • Evaluating how well the linear regression model fits the observed data
  • Determines the proportion of variability in the response variable that is explained by the explanatory variable

Coefficient of determination (R-squared)

  • Measures the proportion of variability in the response variable that is explained by the linear regression model
  • Ranges from 0 to 1, with higher values indicating a better fit of the model to the data
  • Calculated as the ratio of the explained variance to the total variance

Interpretation of R-squared

  • An R-squared value of 0 indicates that the linear model does not explain any of the variability in the response variable
  • An R-squared value of 1 indicates that the linear model perfectly explains all the variability in the response variable
  • Example: An R-squared value of 0.75 means that 75% of the variability in the response variable is explained by the linear model

Correlation coefficient

  • Measures the strength and direction of the linear relationship between two continuous variables
  • Ranges from -1 to 1, with values closer to -1 or 1 indicating a stronger linear relationship

Pearson correlation coefficient

  • Most commonly used correlation coefficient for simple linear regression
  • Measures the strength and direction of the linear relationship between the explanatory and response variables
  • Calculated using the covariance of the two variables divided by the product of their standard deviations

Interpretation of correlation

  • A correlation coefficient of 1 indicates a perfect positive linear relationship
  • A correlation coefficient of -1 indicates a perfect negative linear relationship
  • A correlation coefficient of 0 indicates no linear relationship between the variables
  • Example: A correlation coefficient of 0.8 suggests a strong positive linear relationship between the variables

Hypothesis tests

  • Statistical procedures used to test the significance of the relationship between the explanatory and response variables
  • Determine whether the observed relationship is likely to have occurred by chance or if it represents a true relationship in the population

Significance of slope

  • Tests the null hypothesis that the slope of the regression line is equal to zero (no linear relationship)
  • If the p-value is less than the chosen significance level (e.g., 0.05), the null hypothesis is rejected, indicating a significant linear relationship

Confidence intervals

  • Provide a range of plausible values for the slope and y-intercept of the regression line
  • Indicate the precision and uncertainty associated with the estimated model parameters
  • Example: A 95% confidence interval for the slope of (0.5, 1.2) suggests that the true slope is likely to fall within this range with 95% confidence

Checking model assumptions

  • Verifying that the assumptions underlying simple linear regression are met to ensure the validity of the model and its inferences
  • Violations of assumptions can lead to biased or unreliable results

Linearity

  • The relationship between the explanatory and response variables should be linear
  • Scatterplots can be used to visually assess linearity
  • Residual plots (residuals vs. explanatory variable) can also help detect non-linearity

Independence of errors

  • The errors (residuals) should be independent of each other
  • Violations can occur when data points are collected over time or have a spatial relationship
  • Durbin-Watson test can be used to assess the independence of errors

Constant variance of errors

  • The variability of the errors should be constant across all levels of the explanatory variable (homoscedasticity)
  • Non-constant variance (heteroscedasticity) can be detected using residual plots (residuals vs. fitted values)

Normality of errors

  • The errors should be normally distributed with a mean of zero
  • Normal probability plots or histograms of the residuals can be used to assess normality
  • Shapiro-Wilk or Kolmogorov-Smirnov tests can formally test for normality of errors

Outliers and influential points

  • Observations that deviate substantially from the overall pattern of the data or have a disproportionate impact on the regression model
  • Can affect the estimates of the model parameters and the goodness of fit

Identifying outliers

  • Outliers can be identified using scatterplots or residual plots
  • Points that are far from the majority of the data or have large residuals may be considered outliers
  • Example: In a scatterplot of height and weight, a data point with a height of 200 cm and a weight of 50 kg would be an outlier

Leverage and influence

  • Leverage measures the distance of an observation from the mean of the explanatory variable
  • High leverage points can have a strong influence on the regression line
  • Influence measures the impact of an observation on the model parameters or fitted values
  • Cook's distance is a measure that combines leverage and residuals to assess the overall influence of an observation

Predictions using regression model

  • Using the estimated regression equation to predict the value of the response variable for a given value of the explanatory variable
  • Allows for interpolation and extrapolation based on the observed data

Interpolation vs extrapolation

  • Interpolation involves making predictions within the range of the observed explanatory variable values
  • Extrapolation involves making predictions beyond the range of the observed explanatory variable values
  • Extrapolation carries more uncertainty and should be done with caution

Limitations of simple linear regression

  • Assumes a linear relationship between the explanatory and response variables, which may not always be appropriate
  • Does not account for the influence of other variables that may affect the response variable
  • Sensitive to outliers and influential points, which can distort the model estimates
  • Limited to modeling the relationship between two continuous variables
  • Causal inferences cannot be made solely based on the regression results, as correlation does not imply causation