📊Probabilistic Decision-Making Unit 8 – Linear Regression: Simple & Multiple
Linear regression is a powerful statistical tool for modeling relationships between variables. It estimates how one or more independent variables influence a dependent variable, allowing us to predict outcomes and understand the strength of connections between factors.
This method forms the basis for more advanced statistical techniques. By mastering linear regression, we gain insights into data relationships, make informed predictions, and develop a foundation for complex modeling in various fields like economics, finance, and healthcare.
Linear regression models the relationship between a dependent variable and one (simple) or more (multiple) independent variables
Estimates the parameters of the linear equation that best fits the data using the least squares method
Assumes a linear relationship exists between the dependent and independent variables
Requires meeting assumptions such as linearity, independence, homoscedasticity, and normality of residuals
Provides insights into the strength and direction of the relationship between variables
Enables prediction of the dependent variable based on the values of the independent variable(s)
Serves as a foundation for more advanced regression techniques and statistical modeling
Mathematical Foundations
Linear regression is based on the linear equation y=β0+β1x+ϵ, where y is the dependent variable, x is the independent variable, β0 is the y-intercept, β1 is the slope, and ϵ is the error term
The least squares method minimizes the sum of squared residuals (differences between observed and predicted values) to estimate the parameters β0 and β1
The normal equations, derived from the least squares method, are used to calculate the parameter estimates:
β^1=∑i=1n(xi−xˉ)2∑i=1n(xi−xˉ)(yi−yˉ)
β^0=yˉ−β^1xˉ
The coefficient of determination, denoted as R2, measures the proportion of variance in the dependent variable explained by the independent variable(s)
Hypothesis testing and confidence intervals are used to assess the statistical significance of the estimated parameters and make inferences about the population
Simple Linear Regression
Simple linear regression involves one independent variable and one dependent variable
The goal is to find the line of best fit that minimizes the sum of squared residuals
The slope β1 represents the change in the dependent variable for a one-unit increase in the independent variable, holding other factors constant
The y-intercept β0 represents the value of the dependent variable when the independent variable is zero
The correlation coefficient r measures the strength and direction of the linear relationship between the variables
Hypothesis tests (t-tests) are used to determine the statistical significance of the estimated parameters
Confidence intervals provide a range of plausible values for the population parameters
Multiple Linear Regression
Multiple linear regression extends simple linear regression by including two or more independent variables
The multiple linear regression equation is y=β0+β1x1+β2x2+...+βkxk+ϵ, where k is the number of independent variables
Each βi represents the change in the dependent variable for a one-unit increase in the corresponding independent variable, holding other variables constant
The adjusted R2 is used to compare models with different numbers of independent variables, as it accounts for the complexity of the model
Partial regression coefficients represent the effect of each independent variable on the dependent variable, controlling for the other variables in the model
Multicollinearity, which occurs when independent variables are highly correlated, can affect the interpretation and stability of the model
Stepwise regression methods (forward, backward, or mixed) can be used for variable selection in multiple linear regression
Model Assumptions
Linear regression relies on several assumptions to ensure the validity and reliability of the results:
Linearity: The relationship between the dependent and independent variables is linear
Independence: The observations are independent of each other (no autocorrelation)
Homoscedasticity: The variance of the residuals is constant across all levels of the independent variable(s)
Normality: The residuals follow a normal distribution with a mean of zero
Violations of these assumptions can lead to biased or inefficient parameter estimates and affect the validity of hypothesis tests and confidence intervals
Diagnostic plots (residual plots, Q-Q plots) and statistical tests (Durbin-Watson, Breusch-Pagan, Shapiro-Wilk) can be used to assess the assumptions
Remedial measures, such as data transformations or robust regression methods, can be applied when assumptions are violated
Model Evaluation
The goodness-of-fit of a linear regression model can be assessed using various metrics and techniques:
Coefficient of determination (R2): Measures the proportion of variance in the dependent variable explained by the independent variable(s)
Adjusted R2: Adjusts the R2 for the number of independent variables in the model, penalizing complexity
F-test: Tests the overall significance of the regression model by comparing the explained variance to the unexplained variance
t-tests: Assess the statistical significance of individual regression coefficients
Residual standard error (RSE): Measures the average deviation of the observed values from the predicted values
Cross-validation techniques, such as k-fold cross-validation or leave-one-out cross-validation, can be used to assess the model's performance on unseen data and prevent overfitting
The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are used to compare and select among competing models based on their fit and complexity
Applications in Decision-Making
Linear regression is widely used in various fields to support decision-making processes: