scoresvideos
Statistical Inference
Table of Contents

Contingency tables organize categorical data, showing relationships between variables through frequencies. They're crucial for analyzing associations in fields like market research and epidemiology, helping us understand patterns and dependencies in complex datasets.

Log-linear models take contingency table analysis further, modeling cell frequencies as functions of variable effects. These powerful tools allow us to examine intricate relationships in categorical data, test hypotheses, and make predictions about complex interactions between variables.

Contingency Tables

Construction of contingency tables

  • Contingency table structure organizes categorical data into rows and columns representing different categories with cell frequencies showing count for each combination
  • Types include two-way tables for two variables and multi-way tables for three or more variables (3D or higher)
  • Marginal frequencies sum row totals and column totals providing overall category distributions
  • Conditional frequencies show distribution of one variable given specific category of another
  • Expected frequencies calculate theoretical cell counts assuming independence between variables
  • Relative frequencies express cell counts as percentages (row percentages, column percentages) for easier comparison
  • Independence vs association examines whether variables are related or occur independently
  • Simpson's paradox demonstrates how association between variables can reverse when data combined or separated (medical treatment efficacy)

Concept of log-linear models

  • Log-linear models analyze relationships in categorical data by modeling cell frequencies as function of variable effects
  • Applied to complex categorical data analysis (market segmentation, epidemiology)
  • Related to logistic regression but model cell counts rather than probabilities
  • Advantages include handling multi-way tables and testing complex hypotheses
  • Model components incorporate main effects (individual variable impacts) and interaction effects (combined variable impacts)
  • Hierarchical nature means higher-order effects include all lower-order effects
  • Saturated models include all possible effects while unsaturated models omit some effects for parsimony

Fitting and interpretation of models

  • Model specification uses Poisson regression framework with log link function to relate predictors to cell counts
  • Parameter estimation employs maximum likelihood estimation often using iterative proportional fitting algorithm
  • Model notation utilizes design matrix to represent variable effects and log-linear equations to express relationships
  • Interpretation of model parameters reveals strength and direction of main effects and interaction effects on cell frequencies
  • Odds ratios and relative risk derived from parameters quantify association between variables
  • Contrast coding for categorical variables allows comparison of specific categories or groups

Model selection and goodness-of-fit

  • Goodness-of-fit statistics assess model fit: likelihood ratio statistic ($G^2$) and Pearson chi-square statistic ($X^2$)
  • Degrees of freedom calculation considers number of parameters and table dimensions
  • P-values and significance testing determine if model fits data better than chance
  • Residual analysis examines standardized and adjusted residuals to identify poorly fit cells
  • Model comparison techniques include nested model testing and information criteria (AIC, BIC) to balance fit and complexity
  • Stepwise model selection uses forward selection or backward elimination to build optimal model
  • Parsimony principle favors simpler models with fewer parameters when fit is comparable
  • Model assumptions checked include adequate sample size and avoiding sparse contingency tables with many zero cells