Data, Inference, and Decisions

🎲data, inference, and decisions review

7.5 Multiple linear regression and model selection

Last Updated on August 16, 2024

Multiple linear regression expands on simple linear regression by incorporating multiple predictors to explain a single outcome. This powerful technique allows us to model complex relationships between variables, accounting for multiple factors simultaneously.

Model selection is crucial in multiple regression, balancing complexity and interpretability. Techniques like stepwise regression, information criteria, and regularization help us choose the most appropriate model, avoiding overfitting while capturing important relationships in the data.

Multiple Regression Models

Extending Simple Linear Regression

Multiple linear regression incorporates two or more independent variables to predict a single dependent variable
General form of multiple linear regression model $Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε$ $Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + ... + β_{k} X_{k} + ε$
- Y represents dependent variable
- X₁, X₂, ..., Xₖ represent independent variables
- β₀ represents y-intercept
- β₁, β₂, ..., βₖ represent regression coefficients
- ε represents error term
Assumptions of multiple linear regression
- Linearity between independent and dependent variables
- Independence of errors
- Homoscedasticity (constant variance of residuals)
- Normality of residuals
- Absence of multicollinearity among independent variables
Coefficient of determination (R²) measures proportion of variance in dependent variable explained by independent variables collectively
Adjusted R² accounts for number of predictors and penalizes addition of unnecessary variables
F-statistic tests overall significance of regression model
- Evaluates whether model explains significant amount of variance in dependent variable

Model Evaluation and Significance

R² interpretation
- Ranges from 0 to 1
- Higher values indicate better model fit (0.7 considered good fit in social sciences)
- Example: R² of 0.85 means 85% of variance in dependent variable explained by model
Adjusted R² comparison
- Always lower than or equal to R²
- Useful for comparing models with different numbers of predictors
- Example: Model A (R² = 0.80, Adj R² = 0.78) vs Model B (R² = 0.82, Adj R² = 0.77)
F-statistic interpretation
- Large F-value with small p-value (< 0.05) indicates significant model
- Example: F(3, 96) = 25.6, p < 0.001 suggests strong overall model significance

Interpreting Coefficients

Understanding Regression Coefficients

Each regression coefficient (β₁, β₂, ..., βₖ) represents expected change in dependent variable for one-unit increase in corresponding independent variable, holding all other variables constant
Y-intercept (β₀) represents expected value of dependent variable when all independent variables are zero
Standardized coefficients (beta coefficients) allow comparison of relative importance of different independent variables
- Expressed in units of standard deviations
Sign of coefficient indicates direction of relationship between independent and dependent variables
- Positive sign: positive relationship
- Negative sign: negative relationship
Magnitude of coefficient reflects strength of relationship between independent and dependent variables
- Interpretation must consider scale of measurement
Statistical significance of individual coefficients assessed using t-tests
- Null hypothesis: coefficient equals zero
Confidence intervals for coefficients provide range of plausible values for true population parameter
- Offer insight into precision of estimates

Practical Interpretation Examples

Unstandardized coefficient interpretation
- Example: In a housing price model, β₁ = 5000 for square footage means each additional square foot increases predicted price by $5000, holding other variables constant
Standardized coefficient comparison
- Example: Beta coefficient for education (0.4) larger than for experience (0.2) in salary prediction model suggests education has stronger effect on salary
Confidence interval interpretation
- Example: 95% CI for age coefficient [1.2, 3.5] in health insurance cost model indicates we can be 95% confident true population effect of age on cost lies between $1.20 and$ 3.50 per year

Model Selection Techniques

Stepwise Regression Methods

Forward selection iteratively adds variables based on statistical criteria
- Starts with no predictors, adds most significant variable at each step
Backward elimination iteratively removes variables based on statistical criteria
- Starts with all predictors, removes least significant variable at each step
Stepwise selection combines forward and backward approaches
- Adds and removes variables at each step based on significance
Advantages: Automated process, can handle large number of predictors
Limitations: May not find optimal model, sensitive to multicollinearity

Information Criteria and Cross-Validation

Akaike Information Criterion (AIC) balances model fit and complexity
- Lower values indicate better models
- Formula: AIC = 2k - 2ln(L), where k is number of parameters and L is likelihood
Bayesian Information Criterion (BIC) similar to AIC but penalizes complexity more heavily
- Formula: BIC = k ln(n) - 2ln(L), where n is sample size
Cross-validation assesses model performance on out-of-sample data
- K-fold cross-validation divides data into k subsets, trains on k-1 subsets and tests on remaining subset
- Example: 5-fold cross-validation with RMSE as metric

Regularization and Other Techniques

Lasso (Least Absolute Shrinkage and Selection Operator) regression performs variable selection
- Shrinks some coefficients to exactly zero
- Useful for high-dimensional data
Ridge regression handles multicollinearity by adding penalty term to loss function
- Shrinks coefficients towards zero but not exactly to zero
Elastic Net combines Lasso and Ridge regression approaches
Mallows' Cp statistic compares predictive ability of subset models to full model
- Values close to number of predictors indicate good fit
Variable Inflation Factor (VIF) identifies multicollinearity among predictors
- VIF > 10 suggests high multicollinearity
- Example: In marketing model, VIF for advertising spend (12.5) and sales promotion (11.8) indicates potential multicollinearity

Complexity vs Interpretability

Bias-Variance Trade-off and Model Parsimony

Bias-variance trade-off balances underfitting (high bias) and overfitting (high variance) in model selection
- Underfitting: model too simple, fails to capture important patterns (straight line fitted to curved data)
- Overfitting: model too complex, captures noise in data (high-degree polynomial fitted to linear data)
Parsimony principle suggests simpler models with fewer predictors preferable when explaining data equally well
- Occam's Razor applied to statistical modeling
- Example: Linear model with R² = 0.85 preferred over quadratic model with R² = 0.86
Curse of dimensionality refers to challenges in analyzing high-dimensional data
- Leads to overfitting and reduced interpretability
- Example: Model with 100 predictors for 150 observations likely overfits

Managing Model Complexity

Feature engineering creates new variables from existing ones to improve model performance
- Example: Creating interaction term between temperature and humidity in weather prediction model
Dimensionality reduction techniques reduce number of variables while retaining important information
- Principal Component Analysis (PCA) creates uncorrelated linear combinations of original variables
- Example: Reducing 20 financial indicators to 5 principal components in stock market analysis
Model interpretability involves ease of understanding and explaining model's predictions
- Trade-off between complex, highly accurate models and simpler, more interpretable models
- Example: Deep neural network (high accuracy, low interpretability) vs decision tree (lower accuracy, high interpretability)
Regularization techniques control model complexity by penalizing large coefficient values
- Lasso: Encourages sparsity, sets some coefficients to zero
- Ridge: Shrinks coefficients towards zero but not to zero
- Example: Lasso regression on gene expression data selects subset of most important genes

Back

Practice Quiz

Table of Contents