Linear Modeling Theory

🥖Linear Modeling Theory Unit 14 – Logistic & Poisson Regression Models

Logistic and Poisson regression are powerful tools for modeling binary outcomes and count data. These specialized forms of generalized linear models use maximum likelihood estimation to predict probabilities and event counts based on independent variables. These models are crucial in fields like healthcare, marketing, and social sciences. They handle non-linear relationships between variables and provide insights through odds ratios and incidence rate ratios, making them essential for analyzing categorical and count data.

What's the deal with Logistic & Poisson Regression?

  • Logistic and Poisson regression are specialized forms of generalized linear models (GLMs) used to model specific types of dependent variables
  • Logistic regression predicts binary outcomes (yes/no, 0/1) by estimating the probability of an event occurring based on independent variables
  • Poisson regression models count data, where the dependent variable represents the number of occurrences of an event in a fixed interval (number of customer complaints per day)
  • Both models assume a non-linear relationship between the independent variables and the dependent variable, unlike linear regression which assumes a linear relationship
  • Logistic and Poisson regression use maximum likelihood estimation (MLE) to estimate the model parameters, which finds the values that maximize the likelihood of observing the data given the model
    • MLE is an iterative process that starts with initial estimates and adjusts them until convergence is reached
  • These models are essential tools for analyzing categorical and count data in various fields (healthcare, marketing, social sciences)

Key concepts you need to know

  • Odds ratio: A measure of association between an exposure and an outcome, representing the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure
  • Incidence rate ratio (IRR): The ratio of two incidence rates, comparing the rate of events in one group to the rate of events in another group
  • Deviance: A measure of goodness of fit for GLMs, comparing the log-likelihood of the fitted model to the log-likelihood of a saturated model
    • Lower deviance indicates better fit
  • Overdispersion: When the observed variance in the data is greater than the variance assumed by the model (mean ≠ variance)
    • Violates the assumption of Poisson regression and can lead to incorrect standard errors and p-values
  • Link function: A function that relates the linear predictor to the mean of the distribution function, allowing for non-linear relationships between the independent variables and the dependent variable
    • Logistic regression uses the logit link: ln(p1p)\ln(\frac{p}{1-p})
    • Poisson regression uses the log link: ln(μ)\ln(\mu)
  • Confusion matrix: A table used to evaluate the performance of a classification model (logistic regression), showing the counts of true positives, true negatives, false positives, and false negatives

The math behind it (don't freak out!)

  • Logistic regression models the probability of an event occurring as a function of the independent variables using the logistic function:

    P(Y=1X)=eβ0+β1X1++βpXp1+eβ0+β1X1++βpXpP(Y=1|X) = \frac{e^{\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p}}{1 + e^{\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p}}

    where β0\beta_0 is the intercept and β1,,βp\beta_1, \ldots, \beta_p are the coefficients for the independent variables X1,,XpX_1, \ldots, X_p

  • Poisson regression models the expected count of events as a function of the independent variables using the exponential function:

    E(YX)=eβ0+β1X1++βpXpE(Y|X) = e^{\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p}

    where β0\beta_0 is the intercept and β1,,βp\beta_1, \ldots, \beta_p are the coefficients for the independent variables X1,,XpX_1, \ldots, X_p

  • The coefficients in both models are estimated using maximum likelihood estimation, which finds the values that maximize the likelihood function:

    L(βy)=i=1nP(Yi=yiXi,β)L(\beta|y) = \prod_{i=1}^n P(Y_i=y_i|X_i, \beta)

    where YiY_i is the observed outcome for observation ii, XiX_i is the vector of independent variables for observation ii, and β\beta is the vector of coefficients

  • Confidence intervals for the coefficients can be calculated using the standard errors obtained from the inverse of the Hessian matrix (matrix of second partial derivatives of the log-likelihood function)

  • Hypothesis tests for the significance of the coefficients can be performed using Wald tests or likelihood ratio tests

When to use these models

  • Use logistic regression when the dependent variable is binary or categorical (pass/fail, yes/no, customer churn)
    • Can be extended to multinomial logistic regression for dependent variables with more than two categories
  • Use Poisson regression when the dependent variable is a count (number of accidents per year, number of customer purchases per month)
    • Appropriate when the events are independent and the rate of occurrence is constant over time
  • Consider the assumptions of each model before applying them to your data:
    • Logistic regression assumes a linear relationship between the independent variables and the log odds of the outcome, no multicollinearity, and independence of observations
    • Poisson regression assumes the mean and variance of the dependent variable are equal (equidispersion), independence of events, and a constant rate of occurrence
  • If the assumptions are violated, consider alternative models (negative binomial regression for overdispersed count data, mixed-effects models for clustered data)

Building and interpreting the models

  • Start by exploring your data and checking for missing values, outliers, and collinearity among independent variables
  • Split your data into training and testing sets to evaluate the model's performance on unseen data
  • Use appropriate coding schemes for categorical variables (dummy coding, effect coding) and scale continuous variables if necessary
  • Fit the model using statistical software (R, Python, SAS) and assess the model's fit using deviance, AIC, or BIC
    • Lower values indicate better fit, but be cautious of overfitting
  • Interpret the coefficients in terms of odds ratios (logistic regression) or incidence rate ratios (Poisson regression)
    • For logistic regression, eβje^{\beta_j} represents the change in odds of the outcome for a one-unit increase in XjX_j, holding other variables constant
    • For Poisson regression, eβje^{\beta_j} represents the change in the expected count of events for a one-unit increase in XjX_j, holding other variables constant
  • Assess the model's predictive performance using metrics such as accuracy, precision, recall, and F1 score (logistic regression) or mean squared error and mean absolute error (Poisson regression)

Common pitfalls and how to avoid them

  • Multicollinearity: High correlation among independent variables can lead to unstable coefficient estimates and inflated standard errors
    • Check for multicollinearity using variance inflation factors (VIF) or correlation matrices
    • Consider removing or combining highly correlated variables, or using regularization techniques (ridge regression, lasso)
  • Overfitting: Models that are too complex may fit the noise in the training data, leading to poor performance on new data
    • Use cross-validation or regularization to prevent overfitting
    • Compare models using information criteria (AIC, BIC) and choose the simplest model that adequately fits the data
  • Imbalanced data: When the classes in a binary outcome are not equally represented, the model may have difficulty learning the minority class
    • Consider oversampling the minority class, undersampling the majority class, or using weighted loss functions
    • Evaluate the model using metrics that are sensitive to class imbalance (F1 score, area under the precision-recall curve)
  • Outliers and influential observations: Extreme values can have a disproportionate impact on the model's coefficients and fit
    • Identify outliers using diagnostic plots (residuals vs. fitted values, Cook's distance)
    • Consider removing or downweighting influential observations, or using robust regression techniques (M-estimation, least trimmed squares)

Real-world applications

  • Healthcare: Predicting the risk of disease based on patient characteristics (age, gender, lifestyle factors), modeling the number of hospital admissions for a specific condition
  • Marketing: Predicting customer churn based on demographics and purchase history, modeling the number of product purchases per customer
  • Finance: Predicting the probability of loan default based on borrower characteristics and credit history, modeling the number of insurance claims filed per policy
  • Social sciences: Predicting voting behavior based on demographic and socioeconomic factors, modeling the number of arrests per neighborhood
  • Ecology: Predicting the presence or absence of a species based on habitat characteristics, modeling the number of animal sightings per survey

Tips for acing your assignments

  • Read the assignment instructions carefully and make sure you understand the research question and the variables involved
  • Explore your data thoroughly before fitting any models, and report any data cleaning or preprocessing steps
  • Justify your choice of model based on the nature of the dependent variable and the assumptions of the model
  • Interpret your results in the context of the research question and the real-world implications of your findings
  • Use clear and concise language to communicate your methods and results, and include visualizations where appropriate
  • Double-check your code and output for errors, and make sure your conclusions are supported by your analysis
  • Seek feedback from your instructor or peers, and be open to constructive criticism and suggestions for improvement


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.