Statistical Methods for Data Science

9.1 Binary Logistic Regression

Citation:

Binary logistic regression is a powerful tool for predicting outcomes with two possible results. It's like flipping a coin, but with data-driven probabilities instead of 50/50 chances. This method helps us understand how different factors influence the likelihood of an event happening.

In this section, we'll explore the nuts and bolts of binary logistic regression. We'll learn about the logistic function, odds ratios, and how to interpret model coefficients. These concepts are crucial for making sense of your results and applying them in real-world situations.

Logistic Regression Fundamentals

Logistic Function and Odds Ratio

Logistic regression models the probability of a binary outcome using the logistic function, which maps any real number to a value between 0 and 1
- The logistic function is defined as: $p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x)}}$
- $p(x)$ represents the probability of the event occurring given the input variable $x$
- $\beta_0$ and $\beta_1$ are the coefficients estimated from the data
The odds ratio is a measure of association between an exposure and an outcome, representing the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure
- Odds ratio = $\frac{odds(event | exposure)}{odds(event | no exposure)}$
- An odds ratio greater than 1 indicates that the exposure is associated with higher odds of the outcome, while an odds ratio less than 1 indicates that the exposure is associated with lower odds of the outcome

Logit Transformation and Coefficient Interpretation

The logit transformation is the logarithm of the odds ratio, which allows for a linear relationship between the predictor variables and the log-odds of the outcome
- Logit transformation: $logit(p) = \ln(\frac{p}{1-p}) = \beta_0 + \beta_1x$
- The logit transformation maps probabilities from the range [0, 1] to the entire real line, enabling the use of linear regression techniques
The coefficients in a logistic regression model represent the change in the log-odds of the outcome for a one-unit increase in the corresponding predictor variable, holding other variables constant
- A positive coefficient indicates that an increase in the predictor variable is associated with an increase in the log-odds of the outcome
- A negative coefficient indicates that an increase in the predictor variable is associated with a decrease in the log-odds of the outcome
- To interpret the coefficients in terms of odds ratios, exponentiate the coefficients: $e^{\beta_i}$

Parameter Estimation and Inference

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is used to estimate the coefficients in a logistic regression model by finding the values that maximize the likelihood function
- The likelihood function represents the probability of observing the data given the model parameters
- MLE seeks to find the parameter values that make the observed data most likely
- The log-likelihood function is often used for computational convenience: $\ell(\beta) = \sum_{i=1}^n [y_i \ln(p(x_i)) + (1-y_i) \ln(1-p(x_i))]$
Iterative optimization algorithms, such as Newton-Raphson or gradient descent, are used to find the maximum likelihood estimates of the coefficients

Hypothesis Testing: Wald Test and Likelihood Ratio Test

The Wald test is used to assess the significance of individual coefficients in a logistic regression model
- The Wald test statistic is calculated as: $z = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}$, where $\hat{\beta}_j$ is the estimated coefficient and $SE(\hat{\beta}_j)$ is its standard error
- The Wald test follows a standard normal distribution under the null hypothesis that the coefficient is zero
The likelihood ratio test compares the fit of two nested models, one with the predictor variable(s) of interest and one without
- The test statistic is calculated as: $-2 \ln(\frac{L_{reduced}}{L_{full}})$, where $L_{reduced}$ and $L_{full}$ are the likelihoods of the reduced and full models, respectively
- The test statistic follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the two models

Model Interpretation

Predicted Probabilities and Odds Ratios

Predicted probabilities are the estimated probabilities of the outcome for given values of the predictor variables, calculated using the fitted logistic regression model
- For a given set of predictor values $x_i$, the predicted probability is: $\hat{p}(x_i) = \frac{1}{1 + e^{-(\hat{\beta}_0 + \hat{\beta}_1x_i)}}$
- Predicted probabilities provide a more intuitive interpretation of the model's results compared to log-odds or coefficients
Odds ratios can be calculated from the coefficients of the logistic regression model to quantify the association between predictor variables and the outcome
- The odds ratio for a one-unit increase in a predictor variable $x_j$ is given by: $OR_j = e^{\beta_j}$
- An odds ratio greater than 1 indicates that an increase in the predictor variable is associated with higher odds of the outcome, while an odds ratio less than 1 indicates that an increase in the predictor variable is associated with lower odds of the outcome

Coefficient Interpretation and Model Application

Interpreting coefficients in a logistic regression model involves understanding the change in the log-odds or odds of the outcome associated with a one-unit increase in the predictor variable
- For a coefficient $\beta_j$, a one-unit increase in the predictor variable $x_j$ is associated with a $\beta_j$ change in the log-odds of the outcome, holding other variables constant
- Exponentiating the coefficient yields the odds ratio: $OR_j = e^{\beta_j}$, which represents the multiplicative change in the odds of the outcome for a one-unit increase in $x_j$
Logistic regression models can be used for various applications, such as:
- Predicting the probability of a binary outcome (e.g., customer churn, loan default, disease diagnosis)
- Identifying important risk factors or predictors associated with the outcome
- Classifying observations into two groups based on a probability threshold (e.g., spam email detection, credit approval)

Table of Contents

📉statistical methods for data science review

9.1 Binary Logistic Regression

Logistic Regression Fundamentals

Logistic Function and Odds Ratio

Logit Transformation and Coefficient Interpretation

Parameter Estimation and Inference

Maximum Likelihood Estimation

Hypothesis Testing: Wald Test and Likelihood Ratio Test

Model Interpretation

Predicted Probabilities and Odds Ratios

Coefficient Interpretation and Model Application

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes