scoresvideos
Statistical Methods for Data Science
Table of Contents

Probability distributions are the backbone of statistical analysis. They help us model real-world events and make predictions. This section covers key discrete and continuous distributions, their properties, and applications.

Understanding these distributions is crucial for data science. From Bernoulli trials to Normal curves, each distribution has unique characteristics that shape how we interpret data and make informed decisions in various fields.

Discrete Distributions

Bernoulli and Binomial Distributions

  • Bernoulli distribution models a single trial with two possible outcomes (success or failure) with probability of success $p$ and failure $1-p$
  • Probability mass function (PMF) for Bernoulli distribution: $P(X=x) = p^x(1-p)^{1-x}$ for $x=0,1$
  • Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials with constant probability of success
  • PMF for Binomial distribution: $P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$ for $k=0,1,2,...,n$
  • Mean and variance of Binomial distribution: $E(X)=np$ and $Var(X)=np(1-p)$

Poisson Distribution

  • Poisson distribution models the number of events occurring in a fixed interval of time or space, given a known constant rate of occurrence
  • PMF for Poisson distribution: $P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}$ for $k=0,1,2,...$
  • Mean and variance of Poisson distribution: $E(X)=\lambda$ and $Var(X)=\lambda$
  • Poisson distribution approximates Binomial distribution when $n$ is large and $p$ is small, such that $\lambda=np$ remains constant (rare events in a large number of trials)

Continuous Symmetric Distributions

Uniform Distribution

  • Uniform distribution models a continuous random variable with equal probability density over a specified interval $[a,b]$
  • Probability density function (PDF) for Uniform distribution: $f(x) = \frac{1}{b-a}$ for $a \leq x \leq b$
  • Mean and variance of Uniform distribution: $E(X)=\frac{a+b}{2}$ and $Var(X)=\frac{(b-a)^2}{12}$

Normal and t-Distributions

  • Normal distribution (Gaussian distribution) models a continuous random variable with a symmetric, bell-shaped density curve
  • PDF for Normal distribution: $f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ for $-\infty < x < \infty$
  • Mean and variance of Normal distribution: $E(X)=\mu$ and $Var(X)=\sigma^2$
  • Standard Normal distribution has $\mu=0$ and $\sigma=1$, denoted as $Z \sim N(0,1)$
  • t-distribution models the distribution of the t-statistic, which is used for inference with small sample sizes or unknown population variance
  • t-distribution has heavier tails than the Normal distribution and is characterized by its degrees of freedom $\nu$
  • As $\nu$ increases, the t-distribution approaches the standard Normal distribution

Continuous Skewed Distributions

Exponential Distribution

  • Exponential distribution models the time between events in a Poisson process (memoryless property)
  • PDF for Exponential distribution: $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$
  • Mean and variance of Exponential distribution: $E(X)=\frac{1}{\lambda}$ and $Var(X)=\frac{1}{\lambda^2}$
  • Exponential distribution is the continuous analogue of the Geometric distribution

Chi-square and F-Distributions

  • Chi-square distribution models the sum of squares of independent standard Normal random variables
  • Chi-square distribution is characterized by its degrees of freedom $\nu$, which equals the number of standard Normal random variables summed
  • Mean and variance of Chi-square distribution: $E(X)=\nu$ and $Var(X)=2\nu$
  • F-distribution models the ratio of two independent Chi-square random variables divided by their respective degrees of freedom
  • F-distribution is characterized by its numerator degrees of freedom $\nu_1$ and denominator degrees of freedom $\nu_2$
  • F-distribution is used in analysis of variance (ANOVA) and regression analysis to test for the significance of factors or predictors