Stochastic Processes

5.6 Hidden Markov models

Citation:

Hidden Markov models are probabilistic tools used to analyze sequences of observations linked to hidden states. They're crucial in fields like speech recognition and bioinformatics, helping us understand complex systems with underlying, unseen processes.

HMMs consist of state and observation spaces, initial state distribution, and transition and observation probabilities. They assume the Markov property for states and conditional independence for observations, enabling efficient computation and inference in various applications.

Definition of hidden Markov models

Hidden Markov models (HMMs) are a class of probabilistic models used to analyze and predict sequences of observations that depend on underlying hidden states
HMMs are widely used in various domains of stochastic processes, including speech recognition, bioinformatics, and financial modeling
The key idea behind HMMs is that the observed data is generated by a hidden process that transitions between different states according to certain probabilities

Components of HMMs

State space

The state space is the set of possible hidden states in the model
Each state represents a distinct configuration or mode of the system being modeled
The states are not directly observable but influence the generation of the observed data
Examples of state spaces include different phonemes in speech recognition or different market regimes in financial modeling

Observation space

The observation space is the set of possible observable outputs or emissions from each state
Each state can emit observations according to a probability distribution associated with that state
The observations provide indirect information about the underlying hidden states
Examples of observation spaces include acoustic features in speech recognition or stock price movements in financial modeling

Initial state distribution

The initial state distribution specifies the probabilities of starting in each state at the beginning of the sequence
It represents the prior knowledge or belief about the initial state of the system
The initial state distribution is typically denoted as a vector $\pi$ where $\pi_i$ represents the probability of starting in state $i$

State transition probabilities

The state transition probabilities define the likelihood of transitioning from one state to another over time
They capture the temporal dynamics and dependencies between the hidden states
The transition probabilities are typically represented as a matrix $A$ where $a_{ij}$ denotes the probability of transitioning from state $i$ to state $j$
The transition probabilities satisfy the properties of a stochastic matrix, where each row sums to 1

Observation probabilities

The observation probabilities specify the likelihood of emitting each observation given a particular hidden state
They capture the relationship between the hidden states and the observable data
The observation probabilities are typically represented as a matrix $B$ where $b_{ij}$ denotes the probability of emitting observation $j$ given state $i$
The observation probabilities can be discrete probability distributions or continuous probability density functions depending on the nature of the observations

Assumptions in HMMs

Markov property of states

The Markov property assumes that the current state depends only on the previous state, not on the entire history of states
This assumption simplifies the model and enables efficient computation of probabilities
Mathematically, $P(q_t | q_{t-1}, q_{t-2}, ..., q_1) = P(q_t | q_{t-1})$, where $q_t$ represents the state at time $t$
The Markov property allows for tractable inference and learning in HMMs

Conditional independence of observations

The conditional independence assumption states that the current observation depends only on the current state, not on the previous states or observations
Given the current state, the observations are assumed to be independent of each other
Mathematically, $P(o_t | q_t, q_{t-1}, ..., q_1, o_{t-1}, ..., o_1) = P(o_t | q_t)$, where $o_t$ represents the observation at time $t$
This assumption simplifies the computation of observation probabilities and enables efficient inference algorithms

Types of HMM problems

Evaluation problem

The evaluation problem aims to compute the likelihood of observing a given sequence of observations given the parameters of the HMM
It involves calculating $P(O | \lambda)$, where $O$ is the observation sequence and $\lambda$ represents the HMM parameters
The evaluation problem is solved using the forward algorithm, which efficiently computes the likelihood by exploiting the recursive structure of the HMM

Decoding problem

The decoding problem seeks to find the most likely sequence of hidden states that generated a given observation sequence
It involves finding $\arg\max_Q P(Q | O, \lambda)$, where $Q$ is a sequence of states and $O$ is the observation sequence
The decoding problem is solved using the Viterbi algorithm, which efficiently computes the most probable state sequence using dynamic programming

Learning problem

The learning problem involves estimating the parameters of an HMM given a set of observation sequences
It aims to find the parameters $\lambda$ that maximize the likelihood of the observed data
The learning problem is typically solved using the Baum-Welch algorithm, which is an instance of the Expectation-Maximization (EM) algorithm
The Baum-Welch algorithm iteratively updates the HMM parameters to improve the likelihood of the observed sequences

Forward algorithm for evaluation

Forward variables

The forward variables, denoted as $\alpha_t(i)$, represent the probability of being in state $i$ at time $t$ and observing the partial sequence $o_1, o_2, ..., o_t$
Mathematically, $\alpha_t(i) = P(o_1, o_2, ..., o_t, q_t = i | \lambda)$
The forward variables capture the likelihood of the partial observation sequence up to time $t$ and ending in state $i$

Recursive computation

The forward variables are computed recursively using the following steps:
1. Initialization: $\alpha_1(i) = \pi_i b_i(o_1)$ for all states $i$
2. Recursion: $\alpha_{t+1}(j) = \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(o_{t+1})$ for $t = 1, 2, ..., T-1$ and all states $j$
3. Termination: $P(O | \lambda) = \sum_{i=1}^N \alpha_T(i)$
The recursive computation efficiently computes the forward variables by exploiting the Markov property and the conditional independence of observations

Termination and likelihood

The termination step computes the likelihood of the entire observation sequence by summing the forward variables at the final time step
Mathematically, $P(O | \lambda) = \sum_{i=1}^N \alpha_T(i)$
The likelihood provides a measure of how well the HMM fits the observed data and can be used for model selection or comparison

Viterbi algorithm for decoding

Viterbi variables

The Viterbi variables, denoted as $\delta_t(i)$, represent the probability of the most likely state sequence ending in state $i$ at time $t$ and observing the partial sequence $o_1, o_2, ..., o_t$
Mathematically, $\delta_t(i) = \max_{q_1, q_2, ..., q_{t-1}} P(q_1, q_2, ..., q_{t-1}, q_t = i, o_1, o_2, ..., o_t | \lambda)$
The Viterbi variables keep track of the most probable path leading to each state at each time step

Recursive computation

The Viterbi variables are computed recursively using the following steps:
1. Initialization: $\delta_1(i) = \pi_i b_i(o_1)$ for all states $i$
2. Recursion: $\delta_{t+1}(j) = \max_{i=1}^N [\delta_t(i) a_{ij}] b_j(o_{t+1})$ for $t = 1, 2, ..., T-1$ and all states $j$
3. Termination: $P^* = \max_{i=1}^N \delta_T(i)$ and $q_T^* = \arg\max_{i=1}^N \delta_T(i)$
The recursive computation efficiently computes the Viterbi variables by keeping track of the most probable path leading to each state at each time step

Termination and optimal state sequence

The termination step computes the probability of the most likely state sequence and the corresponding final state
The optimal state sequence is obtained by backtracking from the final state using the argument that maximized the Viterbi variables at each time step
The Viterbi algorithm provides the most likely explanation of the observed data in terms of the hidden states

Baum-Welch algorithm for learning

Forward and backward variables

The Baum-Welch algorithm uses both forward variables $\alpha_t(i)$ and backward variables $\beta_t(i)$
The backward variables $\beta_t(i)$ represent the probability of observing the partial sequence $o_{t+1}, o_{t+2}, ..., o_T$ given being in state $i$ at time $t$
Mathematically, $\beta_t(i) = P(o_{t+1}, o_{t+2}, ..., o_T | q_t = i, \lambda)$
The backward variables are computed recursively in a similar manner to the forward variables, but in the reverse direction

Expectation step

The expectation step computes the expected number of transitions between states and the expected number of emissions of each observation from each state
It uses the forward and backward variables to compute the posterior probabilities of being in each state at each time step and transitioning between states
The expected counts are computed using the formulas:
- $\gamma_t(i) = \frac{\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)}$
- $\xi_t(i,j) = \frac{\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}{\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}$
The expected counts provide the sufficient statistics for updating the HMM parameters in the maximization step

Maximization step

The maximization step updates the HMM parameters based on the expected counts computed in the expectation step
The updated parameters are computed using the following formulas:
- $\pi_i = \gamma_1(i)$
- $a_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)}$
- $b_j(k) = \frac{\sum_{t=1}^T \gamma_t(j) \mathbf{1}{o_t = v_k}}{\sum{t=1}^T \gamma_t(j)}$, where $v_k$ is the $k$-th observation symbol
The updated parameters maximize the likelihood of the observed data given the current estimates of the hidden states and transitions

Convergence and learned parameters

The Baum-Welch algorithm iterates between the expectation and maximization steps until convergence
Convergence is typically determined by monitoring the improvement in the likelihood or by setting a maximum number of iterations
The learned parameters represent the best estimate of the HMM that explains the observed data
The learned parameters can be used for various tasks such as evaluation, decoding, or generating new sequences

Extensions of HMMs

Higher-order Markov models

Higher-order Markov models relax the assumption of the first-order Markov property
They allow the current state to depend on multiple previous states instead of just the immediate predecessor
Higher-order models can capture more complex temporal dependencies but at the cost of increased computational complexity and parameter estimation challenges

Input-output HMMs

Input-output HMMs (IOHMMs) extend the basic HMM by incorporating additional input variables that influence the state transitions and observations
IOHMMs allow the modeling of systems where the hidden states and observations depend on external factors or control inputs
Examples of IOHMMs include speech recognition systems that adapt to speaker characteristics or financial models that incorporate macroeconomic variables

Factorial HMMs

Factorial HMMs (FHMMs) represent the hidden state space as a Cartesian product of multiple independent state variables
Each state variable evolves independently according to its own Markov chain, and the observations depend on the combined state of all variables
FHMMs can model complex systems with multiple interacting processes or factors
Examples of FHMMs include modeling multiple speech sources in audio separation or analyzing multiple interacting genetic regulatory networks

Applications of HMMs

Speech recognition

HMMs are widely used in speech recognition systems to model the acoustic and linguistic properties of speech
The hidden states represent different phonemes or sub-word units, and the observations are acoustic features extracted from the speech signal
HMMs enable the recognition of spoken words or phrases by finding the most likely sequence of states given the observed acoustic features

Bioinformatics and sequence analysis

HMMs are applied in bioinformatics to analyze and predict biological sequences such as DNA, RNA, or protein sequences
The hidden states can represent different functional or structural regions in the sequences, and the observations are the individual nucleotides or amino acids
HMMs are used for tasks such as gene prediction, protein family classification, or identifying conserved patterns in sequences

Financial modeling and regime switching

HMMs are used in financial modeling to capture the dynamics of market regimes or hidden economic states
The hidden states represent different market conditions or economic regimes, and the observations are financial time series data such as stock prices or economic indicators
HMMs enable the identification of regime shifts, volatility clustering, or long-term dependencies in financial markets

Limitations and challenges of HMMs

Model selection and complexity

Choosing the appropriate number of hidden states and the structure of the HMM is a challenging task
Model selection techniques such as cross-validation, information criteria, or Bayesian methods can be used to compare different HMM architectures
Overly complex models may lead to overfitting and poor generalization, while overly simplistic models may not capture the underlying dynamics adequately

Identifiability issues

HMMs may suffer from identifiability issues, where different parameter settings can lead to the same observable behavior
Identifiability problems can arise due to the presence of symmetries or redundancies in the model structure
Identifiability issues can complicate parameter estimation and interpretation of the learned model

Computational complexity and scalability

The computational complexity of HMM algorithms grows with the number of states and the length of the observation sequences
The forward, Viterbi, and Baum-Welch algorithms have a time complexity of $O(N^2T)$, where $N$ is the number of states and $T$ is the sequence length
Scaling HMMs to large state spaces or long sequences can be computationally challenging and may require approximation techniques or parallel computing approaches
Techniques such as beam search, pruning, or variational inference can be used to mitigate the computational burden in large-scale applications

Table of Contents

🔀stochastic processes review

5.6 Hidden Markov models

Definition of hidden Markov models

Components of HMMs

State space

Observation space

Initial state distribution

State transition probabilities

Observation probabilities

Assumptions in HMMs

Markov property of states

Conditional independence of observations

Types of HMM problems

Evaluation problem

Decoding problem

Learning problem

Forward algorithm for evaluation

Forward variables

Recursive computation

Termination and likelihood

Viterbi algorithm for decoding

Viterbi variables

Recursive computation

Termination and optimal state sequence

Baum-Welch algorithm for learning

Forward and backward variables

Expectation step

Maximization step

Convergence and learned parameters

Extensions of HMMs

Higher-order Markov models

Input-output HMMs

Factorial HMMs

Applications of HMMs

Speech recognition

Bioinformatics and sequence analysis

Financial modeling and regime switching

Limitations and challenges of HMMs

Model selection and complexity

Identifiability issues

Computational complexity and scalability

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes