scoresvideos
Stochastic Processes
Table of Contents

Hidden Markov models are probabilistic tools used to analyze sequences of observations linked to hidden states. They're crucial in fields like speech recognition and bioinformatics, helping us understand complex systems with underlying, unseen processes.

HMMs consist of state and observation spaces, initial state distribution, and transition and observation probabilities. They assume the Markov property for states and conditional independence for observations, enabling efficient computation and inference in various applications.

Definition of hidden Markov models

  • Hidden Markov models (HMMs) are a class of probabilistic models used to analyze and predict sequences of observations that depend on underlying hidden states
  • HMMs are widely used in various domains of stochastic processes, including speech recognition, bioinformatics, and financial modeling
  • The key idea behind HMMs is that the observed data is generated by a hidden process that transitions between different states according to certain probabilities

Components of HMMs

State space

  • The state space is the set of possible hidden states in the model
  • Each state represents a distinct configuration or mode of the system being modeled
  • The states are not directly observable but influence the generation of the observed data
  • Examples of state spaces include different phonemes in speech recognition or different market regimes in financial modeling

Observation space

  • The observation space is the set of possible observable outputs or emissions from each state
  • Each state can emit observations according to a probability distribution associated with that state
  • The observations provide indirect information about the underlying hidden states
  • Examples of observation spaces include acoustic features in speech recognition or stock price movements in financial modeling

Initial state distribution

  • The initial state distribution specifies the probabilities of starting in each state at the beginning of the sequence
  • It represents the prior knowledge or belief about the initial state of the system
  • The initial state distribution is typically denoted as a vector $\pi$ where $\pi_i$ represents the probability of starting in state $i$

State transition probabilities

  • The state transition probabilities define the likelihood of transitioning from one state to another over time
  • They capture the temporal dynamics and dependencies between the hidden states
  • The transition probabilities are typically represented as a matrix $A$ where $a_{ij}$ denotes the probability of transitioning from state $i$ to state $j$
  • The transition probabilities satisfy the properties of a stochastic matrix, where each row sums to 1

Observation probabilities

  • The observation probabilities specify the likelihood of emitting each observation given a particular hidden state
  • They capture the relationship between the hidden states and the observable data
  • The observation probabilities are typically represented as a matrix $B$ where $b_{ij}$ denotes the probability of emitting observation $j$ given state $i$
  • The observation probabilities can be discrete probability distributions or continuous probability density functions depending on the nature of the observations

Assumptions in HMMs

Markov property of states

  • The Markov property assumes that the current state depends only on the previous state, not on the entire history of states
  • This assumption simplifies the model and enables efficient computation of probabilities
  • Mathematically, $P(q_t | q_{t-1}, q_{t-2}, ..., q_1) = P(q_t | q_{t-1})$, where $q_t$ represents the state at time $t$
  • The Markov property allows for tractable inference and learning in HMMs

Conditional independence of observations

  • The conditional independence assumption states that the current observation depends only on the current state, not on the previous states or observations
  • Given the current state, the observations are assumed to be independent of each other
  • Mathematically, $P(o_t | q_t, q_{t-1}, ..., q_1, o_{t-1}, ..., o_1) = P(o_t | q_t)$, where $o_t$ represents the observation at time $t$
  • This assumption simplifies the computation of observation probabilities and enables efficient inference algorithms

Types of HMM problems

Evaluation problem

  • The evaluation problem aims to compute the likelihood of observing a given sequence of observations given the parameters of the HMM
  • It involves calculating $P(O | \lambda)$, where $O$ is the observation sequence and $\lambda$ represents the HMM parameters
  • The evaluation problem is solved using the forward algorithm, which efficiently computes the likelihood by exploiting the recursive structure of the HMM

Decoding problem

  • The decoding problem seeks to find the most likely sequence of hidden states that generated a given observation sequence
  • It involves finding $\arg\max_Q P(Q | O, \lambda)$, where $Q$ is a sequence of states and $O$ is the observation sequence
  • The decoding problem is solved using the Viterbi algorithm, which efficiently computes the most probable state sequence using dynamic programming

Learning problem

  • The learning problem involves estimating the parameters of an HMM given a set of observation sequences
  • It aims to find the parameters $\lambda$ that maximize the likelihood of the observed data
  • The learning problem is typically solved using the Baum-Welch algorithm, which is an instance of the Expectation-Maximization (EM) algorithm
  • The Baum-Welch algorithm iteratively updates the HMM parameters to improve the likelihood of the observed sequences

Forward algorithm for evaluation

Forward variables

  • The forward variables, denoted as $\alpha_t(i)$, represent the probability of being in state $i$ at time $t$ and observing the partial sequence $o_1, o_2, ..., o_t$
  • Mathematically, $\alpha_t(i) = P(o_1, o_2, ..., o_t, q_t = i | \lambda)$
  • The forward variables capture the likelihood of the partial observation sequence up to time $t$ and ending in state $i$

Recursive computation

  • The forward variables are computed recursively using the following steps:
    1. Initialization: $\alpha_1(i) = \pi_i b_i(o_1)$ for all states $i$
    2. Recursion: $\alpha_{t+1}(j) = \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(o_{t+1})$ for $t = 1, 2, ..., T-1$ and all states $j$
    3. Termination: $P(O | \lambda) = \sum_{i=1}^N \alpha_T(i)$
  • The recursive computation efficiently computes the forward variables by exploiting the Markov property and the conditional independence of observations

Termination and likelihood

  • The termination step computes the likelihood of the entire observation sequence by summing the forward variables at the final time step
  • Mathematically, $P(O | \lambda) = \sum_{i=1}^N \alpha_T(i)$
  • The likelihood provides a measure of how well the HMM fits the observed data and can be used for model selection or comparison

Viterbi algorithm for decoding

Viterbi variables

  • The Viterbi variables, denoted as $\delta_t(i)$, represent the probability of the most likely state sequence ending in state $i$ at time $t$ and observing the partial sequence $o_1, o_2, ..., o_t$
  • Mathematically, $\delta_t(i) = \max_{q_1, q_2, ..., q_{t-1}} P(q_1, q_2, ..., q_{t-1}, q_t = i, o_1, o_2, ..., o_t | \lambda)$
  • The Viterbi variables keep track of the most probable path leading to each state at each time step

Recursive computation

  • The Viterbi variables are computed recursively using the following steps:
    1. Initialization: $\delta_1(i) = \pi_i b_i(o_1)$ for all states $i$
    2. Recursion: $\delta_{t+1}(j) = \max_{i=1}^N [\delta_t(i) a_{ij}] b_j(o_{t+1})$ for $t = 1, 2, ..., T-1$ and all states $j$
    3. Termination: $P^* = \max_{i=1}^N \delta_T(i)$ and $q_T^* = \arg\max_{i=1}^N \delta_T(i)$
  • The recursive computation efficiently computes the Viterbi variables by keeping track of the most probable path leading to each state at each time step

Termination and optimal state sequence

  • The termination step computes the probability of the most likely state sequence and the corresponding final state
  • The optimal state sequence is obtained by backtracking from the final state using the argument that maximized the Viterbi variables at each time step
  • The Viterbi algorithm provides the most likely explanation of the observed data in terms of the hidden states

Baum-Welch algorithm for learning

Forward and backward variables

  • The Baum-Welch algorithm uses both forward variables $\alpha_t(i)$ and backward variables $\beta_t(i)$
  • The backward variables $\beta_t(i)$ represent the probability of observing the partial sequence $o_{t+1}, o_{t+2}, ..., o_T$ given being in state $i$ at time $t$
  • Mathematically, $\beta_t(i) = P(o_{t+1}, o_{t+2}, ..., o_T | q_t = i, \lambda)$
  • The backward variables are computed recursively in a similar manner to the forward variables, but in the reverse direction

Expectation step

  • The expectation step computes the expected number of transitions between states and the expected number of emissions of each observation from each state
  • It uses the forward and backward variables to compute the posterior probabilities of being in each state at each time step and transitioning between states
  • The expected counts are computed using the formulas:
    • $\gamma_t(i) = \frac{\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)}$
    • $\xi_t(i,j) = \frac{\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}{\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}$
  • The expected counts provide the sufficient statistics for updating the HMM parameters in the maximization step

Maximization step

  • The maximization step updates the HMM parameters based on the expected counts computed in the expectation step
  • The updated parameters are computed using the following formulas:
    • $\pi_i = \gamma_1(i)$
    • $a_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)}$
    • $b_j(k) = \frac{\sum_{t=1}^T \gamma_t(j) \mathbf{1}{o_t = v_k}}{\sum{t=1}^T \gamma_t(j)}$, where $v_k$ is the $k$-th observation symbol
  • The updated parameters maximize the likelihood of the observed data given the current estimates of the hidden states and transitions

Convergence and learned parameters

  • The Baum-Welch algorithm iterates between the expectation and maximization steps until convergence
  • Convergence is typically determined by monitoring the improvement in the likelihood or by setting a maximum number of iterations
  • The learned parameters represent the best estimate of the HMM that explains the observed data
  • The learned parameters can be used for various tasks such as evaluation, decoding, or generating new sequences

Extensions of HMMs

Higher-order Markov models

  • Higher-order Markov models relax the assumption of the first-order Markov property
  • They allow the current state to depend on multiple previous states instead of just the immediate predecessor
  • Higher-order models can capture more complex temporal dependencies but at the cost of increased computational complexity and parameter estimation challenges

Input-output HMMs

  • Input-output HMMs (IOHMMs) extend the basic HMM by incorporating additional input variables that influence the state transitions and observations
  • IOHMMs allow the modeling of systems where the hidden states and observations depend on external factors or control inputs
  • Examples of IOHMMs include speech recognition systems that adapt to speaker characteristics or financial models that incorporate macroeconomic variables

Factorial HMMs

  • Factorial HMMs (FHMMs) represent the hidden state space as a Cartesian product of multiple independent state variables
  • Each state variable evolves independently according to its own Markov chain, and the observations depend on the combined state of all variables
  • FHMMs can model complex systems with multiple interacting processes or factors
  • Examples of FHMMs include modeling multiple speech sources in audio separation or analyzing multiple interacting genetic regulatory networks

Applications of HMMs

Speech recognition

  • HMMs are widely used in speech recognition systems to model the acoustic and linguistic properties of speech
  • The hidden states represent different phonemes or sub-word units, and the observations are acoustic features extracted from the speech signal
  • HMMs enable the recognition of spoken words or phrases by finding the most likely sequence of states given the observed acoustic features

Bioinformatics and sequence analysis

  • HMMs are applied in bioinformatics to analyze and predict biological sequences such as DNA, RNA, or protein sequences
  • The hidden states can represent different functional or structural regions in the sequences, and the observations are the individual nucleotides or amino acids
  • HMMs are used for tasks such as gene prediction, protein family classification, or identifying conserved patterns in sequences

Financial modeling and regime switching

  • HMMs are used in financial modeling to capture the dynamics of market regimes or hidden economic states
  • The hidden states represent different market conditions or economic regimes, and the observations are financial time series data such as stock prices or economic indicators
  • HMMs enable the identification of regime shifts, volatility clustering, or long-term dependencies in financial markets

Limitations and challenges of HMMs

Model selection and complexity

  • Choosing the appropriate number of hidden states and the structure of the HMM is a challenging task
  • Model selection techniques such as cross-validation, information criteria, or Bayesian methods can be used to compare different HMM architectures
  • Overly complex models may lead to overfitting and poor generalization, while overly simplistic models may not capture the underlying dynamics adequately

Identifiability issues

  • HMMs may suffer from identifiability issues, where different parameter settings can lead to the same observable behavior
  • Identifiability problems can arise due to the presence of symmetries or redundancies in the model structure
  • Identifiability issues can complicate parameter estimation and interpretation of the learned model

Computational complexity and scalability

  • The computational complexity of HMM algorithms grows with the number of states and the length of the observation sequences
  • The forward, Viterbi, and Baum-Welch algorithms have a time complexity of $O(N^2T)$, where $N$ is the number of states and $T$ is the sequence length
  • Scaling HMMs to large state spaces or long sequences can be computationally challenging and may require approximation techniques or parallel computing approaches
  • Techniques such as beam search, pruning, or variational inference can be used to mitigate the computational burden in large-scale applications