Probability and Statistics

8.2 Maximum likelihood estimation

Citation:

Maximum likelihood estimation is a powerful statistical method for finding the most likely parameter values given observed data. It's widely used in various fields to estimate parameters and fit models to data.

The method works by maximizing a likelihood function, which represents the probability of observing the data given certain parameter values. MLE has desirable properties like consistency and efficiency, making it a go-to approach for many statistical problems.

Definition of maximum likelihood estimation

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution by maximizing a likelihood function
Aims to find the parameter values that make the observed data most probable under the assumed statistical model
Widely used in various fields, including statistics, economics, and machine learning, for parameter estimation and model fitting

Principles of maximum likelihood estimation

Likelihood function for a set of parameters

The likelihood function, denoted as $L(\theta|x)$, quantifies the probability of observing the data $x$ given a set of parameters $\theta$
Represents the joint probability density function or probability mass function of the observed data, treated as a function of the parameters
The likelihood function depends on the assumed statistical model and the observed data

Maximizing the likelihood function

MLE seeks to find the parameter values $\hat{\theta}$ that maximize the likelihood function $L(\theta|x)$
The maximum likelihood estimates are the parameter values that make the observed data most likely under the assumed model
Maximizing the likelihood function involves finding the parameter values that yield the highest probability of observing the given data

Log-likelihood function for simplification

The log-likelihood function, denoted as $\ell(\theta|x) = \log L(\theta|x)$, is often used instead of the likelihood function for computational convenience
Logarithm is a monotonically increasing function, so maximizing the log-likelihood is equivalent to maximizing the likelihood
Working with the log-likelihood simplifies the optimization process, as it converts products into sums and exponential terms into linear terms

Properties of maximum likelihood estimators

Consistency of maximum likelihood estimators

Consistency is a desirable property of an estimator, indicating that the estimator converges to the true parameter value as the sample size increases
Under certain regularity conditions, maximum likelihood estimators are consistent, meaning that $\hat{\theta}$ approaches the true parameter value $\theta$ as the sample size tends to infinity
Consistency ensures that the estimator becomes more accurate as more data is collected

Asymptotic normality of maximum likelihood estimators

Asymptotic normality refers to the property that the distribution of the maximum likelihood estimator approaches a normal distribution as the sample size increases
Under certain regularity conditions, $\sqrt{n}(\hat{\theta} - \theta)$ converges in distribution to a normal distribution with mean zero and variance equal to the inverse of the Fisher information matrix
Asymptotic normality allows for the construction of confidence intervals and hypothesis tests based on the maximum likelihood estimator

Efficiency of maximum likelihood estimators

Efficiency is a measure of the precision of an estimator, indicating how close the estimator is to the true parameter value on average
Maximum likelihood estimators are asymptotically efficient, meaning that they achieve the lowest possible variance among all consistent estimators as the sample size tends to infinity
Efficiency implies that maximum likelihood estimators make the most effective use of the available data in estimating the parameters

Steps in maximum likelihood estimation

Specifying the likelihood function

The first step in MLE is to specify the likelihood function based on the assumed statistical model and the observed data
The likelihood function is constructed by expressing the joint probability density function or probability mass function of the data in terms of the unknown parameters
The functional form of the likelihood function depends on the specific probability distribution assumed for the data (normal distribution, binomial distribution, Poisson distribution)

Taking the log of the likelihood function

To simplify the optimization process, the logarithm of the likelihood function is often taken, resulting in the log-likelihood function
The log-likelihood function is mathematically more tractable and converts products into sums, making it easier to differentiate and maximize
The log-likelihood function retains the same maximum point as the original likelihood function, so maximizing the log-likelihood is equivalent to maximizing the likelihood

Finding the maximum of the log-likelihood function

The next step is to find the parameter values that maximize the log-likelihood function
This is typically done by setting the partial derivatives of the log-likelihood function with respect to each parameter equal to zero and solving the resulting system of equations
Numerical optimization techniques, such as gradient descent or Newton-Raphson method, may be employed to find the maximum likelihood estimates

Checking the second-order conditions

After finding the maximum likelihood estimates, it is important to verify that the obtained values correspond to a maximum and not a minimum or a saddle point
This is done by checking the second-order conditions, which involve evaluating the second partial derivatives of the log-likelihood function at the estimated parameter values
The Hessian matrix, containing the second partial derivatives, should be negative definite at the maximum likelihood estimates to confirm a local maximum

Examples of maximum likelihood estimation

Maximum likelihood estimation for normal distribution

Consider a random sample $X_1, X_2, \ldots, X_n$ from a normal distribution with unknown mean $\mu$ and known variance $\sigma^2$
The likelihood function is given by $L(\mu|x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$
The maximum likelihood estimate of $\mu$ is the sample mean $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i$

Maximum likelihood estimation for binomial distribution

Consider a random sample of $n$ independent Bernoulli trials with unknown success probability $p$, resulting in $k$ successes
The likelihood function is given by $L(p|k) = \binom{n}{k} p^k (1-p)^{n-k}$
The maximum likelihood estimate of $p$ is the sample proportion $\hat{p} = \frac{k}{n}$

Maximum likelihood estimation for Poisson distribution

Consider a random sample $X_1, X_2, \ldots, X_n$ from a Poisson distribution with unknown rate parameter $\lambda$
The likelihood function is given by $L(\lambda|x) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!}$
The maximum likelihood estimate of $\lambda$ is the sample mean $\hat{\lambda} = \frac{1}{n} \sum_{i=1}^n x_i$

Advantages of maximum likelihood estimation

Asymptotic properties of maximum likelihood estimators

Maximum likelihood estimators possess desirable asymptotic properties, such as consistency, asymptotic normality, and asymptotic efficiency
These properties ensure that the estimators converge to the true parameter values, follow a normal distribution, and have the lowest possible variance as the sample size increases
Asymptotic properties provide a theoretical justification for the use of maximum likelihood estimators in large samples

Invariance property of maximum likelihood estimators

The invariance property states that if $\hat{\theta}$ is the maximum likelihood estimator of $\theta$, then for any function $g(\theta)$, the maximum likelihood estimator of $g(\theta)$ is $g(\hat{\theta})$
This property allows for the estimation of functions of parameters without the need to rederive the maximum likelihood estimator
The invariance property simplifies the estimation process and ensures consistency in the estimation of related quantities

Handling missing data with maximum likelihood estimation

Maximum likelihood estimation can handle missing data through techniques such as the Expectation-Maximization (EM) algorithm
The EM algorithm iteratively estimates the missing data based on the observed data and current parameter estimates, and then updates the parameter estimates based on the complete data
MLE with the EM algorithm provides a principled approach to dealing with missing data, making efficient use of the available information

Limitations of maximum likelihood estimation

Sensitivity to model misspecification

Maximum likelihood estimation relies on the assumption that the chosen statistical model correctly describes the data generating process
If the model is misspecified, meaning that the assumed probability distribution does not match the true distribution of the data, the maximum likelihood estimates may be biased or inconsistent
Model misspecification can lead to incorrect inferences and suboptimal performance of the estimators

Computational complexity in high-dimensional problems

In high-dimensional problems, where the number of parameters is large relative to the sample size, maximum likelihood estimation can become computationally challenging
The optimization process may require the evaluation of complex likelihood functions and the computation of high-dimensional gradients and Hessian matrices
Computational complexity can limit the scalability of maximum likelihood estimation in large-scale applications

Existence and uniqueness of maximum likelihood estimators

In some cases, the maximum likelihood estimator may not exist or may not be unique
Non-existence can occur when the likelihood function does not attain a maximum within the parameter space, such as in the case of an unbounded likelihood function
Non-uniqueness can arise when multiple parameter values yield the same maximum likelihood, leading to identifiability issues
Existence and uniqueness problems can complicate the interpretation and reliability of the maximum likelihood estimates

Comparison of maximum likelihood estimation vs other estimation methods

Maximum likelihood estimation vs method of moments

The method of moments (MOM) is another estimation technique that equates sample moments to population moments to estimate parameters
MOM is often simpler to compute than MLE, as it does not require the specification of a likelihood function
However, MLE is generally more efficient than MOM, especially in large samples, and can handle more complex models and missing data

Maximum likelihood estimation vs Bayesian estimation

Bayesian estimation incorporates prior knowledge about the parameters through a prior distribution and updates the estimates based on the observed data
Bayesian estimation provides a posterior distribution of the parameters, allowing for uncertainty quantification and credible intervals
MLE can be seen as a special case of Bayesian estimation with a uniform prior, focusing solely on the likelihood of the data

Maximum likelihood estimation vs least squares estimation

Least squares estimation minimizes the sum of squared differences between the observed data and the predicted values from a model
Least squares estimation is commonly used in regression analysis and is computationally simpler than MLE
MLE is more general and can be applied to a wider range of statistical models, including those with non-normal errors or complex likelihood functions

Table of Contents

📊probability and statistics review