Data Analytics Blog

Data Analytics Case Studies, WhyTos, HowTos, Interviews, News, Events, Jobs and more...# Introduction to Hidden Markov Models using Python

A powerful statistical tool for modeling time series data. It is used for analyzing a generative observable sequence that is characterized by some underlying unobservable sequences. Though the basic theory of Markov Chains is devised in the early 20^{th} century and a full grown Hidden Markov Model(HMM) is developed in the 1960s, its potential is recognized in the last decade only. Its application ranges across the domains like Signal Processing in Electronics, Brownian motions in Chemistry, Random Walks in Statistics (Time Series), Regime Detection in Quantitative Finance and Speech processing tasks such as part-of-speech tagging, phrase chunking and extracting information from provided documents in Artificial Intelligence.

Let us delve into this concept by looking through an example.

My colleague, who lives in a different part of the country, has three unique outfits, Outfit 1, 2 & 3 as O1, O2 & O3 respectively. I am looking to predict his outfit for the next day. So, under the assumption that I possess the probabilities of his outfits and I am aware of his outfit pattern for the last 5 days, O2 O3 O2 O1 O2. Assuming these probabilities are 0.25,0.4,0.35, from the basic probability lectures we went through we can predict the outfit of the next day to be O1 is 0.4*0.35*0.4*0.25*0.4*0.25 = 0.0014. Do you think this is the probability of the outfit O1?? Then it is a big NO. The underlying assumption of this calculation is that his outfit is dependent on the outfit of the preceding day. What if it not. What if it is dependent on some other factors and it is totally independent of the outfit of the preceding day. Then we are clueless. Don’t worry, we will go a bit deeper.

Let us assume that he wears his outfits based on the type of the season on that day. Think there are only two seasons, S1 & S2 exists over his place. I am totally unaware about this season dependence, but I want to predict his outfit, may not be just for one day but for one week or the reason for his outfit on a single given day. Here comes Hidden Markov Model(HMM) for our rescue.

Hoping that you understood the problem statement and the conditions apply HMM, lets define them:

**What is Hidden Markov Model?**

A Hidden Markov Model is a statistical Markov Model (chain) in which the system being modeled is assumed to be a Markov Process with hidden states (or unobserved) states.

It is a bit confusing with full of jargons and only word Markov, I know that feeling. Let’s see it step by step.

**What is a Markov Property?**

A stochastic process (or a random process that is a collection of random variables which changes through time) if the probability of future states of the process depends only upon the present state, not on the sequence of states preceding it.

It is commonly referred as memoryless property.

Any random process that satisfies the Markov Property is known as **Markov Process**.

**What is Markov Model?**

A statistical model that follows the Markov process is referred as Markov Model. There are four common Markov models used in different situations, depending on the whether every sequential state is observable or not and whether the system is to be adjusted based on the observation made:

State | System state is fully observable | System State is partially observable |

A system is autonomous | Markov Chain | Hidden Markov Model |

A system is controlled | Markov Decision Process | Partially observable Markov Decision process |

We will be going through the HMM, as we will be using only this in Artificial Intelligence and Machine Learning.

In our case, under an assumption that his outfit preference is independent of the outfit of the preceding day. So, it follows Markov property. Here, seasons are the hidden states and his outfits are observable sequences. Hence, our example follows Markov property and we can predict his outfits using HMM.

**Difference between Markov Model & Hidden Markov Model**

After going through these definitions, there is a good reason to find the difference between Markov Model and Hidden Markov Model. Our example contains 3 outfits that can be observed, O1, O2 & O3, and 2 seasons, S1 & S2.

Considering the problem statement of our example is about predicting the sequence of seasons, then it is a Markov Model. Besides, our requirement is to predict the outfits that depend on the seasons. In case of initial requirement, we don’t possess any hidden states, the observable states are seasons while in the other, we have both the states, hidden(season) and observable(Outfits) making it a Hidden Markov Model.

So, in other words, we can define HMM as a sequence model. A sequence model or sequence classifier is a model whose job is to assign a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels. An HMM is a probabilistic sequence model, given a sequence of units, they compute a probability distribution over a possible sequence of labels and choose the best label sequence.

We will use a type of dynamic programming named Viterbi algorithm to solve our HMM problem.

**Notations**

In the above experiment, as explained before, three Outfits are the Observation States and two Seasons are the Hidden States. In general, consider there is N number of hidden states and M number of observation states, we now define the notations of our model:

N = number of states in the model i.e. seasons

M = total number of distinct observations i.e. outfits

T = length of observation sequence i.e. the number of outfits observed

i_{t} represents the state, i, in which we are, at time t

V = {V_{1}, ……, V_{M}} discrete set of possible observation symbols

π = probability of being in a state i at the beginning of experiment as **STATE INITIALIZATION PROBABILITY**

A = {a_{ij}} where a_{ij }is the probability of being in state j at a time t+1, given we are at stage i at a time, known as **STATE TRANSITION PROBABILITY**

B = the probability of observing the symbol v_{k }given that we are in state j known as **OBSERVATION PROBABILITY**

O_{t }denotes the observation symbol observed at time t

λ = (A, B, π) a compact notation to denote HMM.

Using this model, we can generate an observation sequence i.e. O_{1}, O_{2},_{ }O_{3},_{ }O_{4 …………… }O_{N}. In our experiment, the set of probabilities defined above are the initial state probabilities or π. We need to define a set of state transition probabilities. This tells us that the probability of moving from one state to the other state.

**Figure 1** depicts the initial state probabilities. We can visualize A or transition state probabilities as in **Figure 2**. It shows the Markov model of our experiment, as it has only one observable layer.

The extension of this is **Figure 3 **which contains two layers, one is hidden layer i.e. seasons and the other layer is observable i.e. outfits that depict the Hidden Markov Model.

All the numbers on the curves are the probabilities that define the transition from one state to another state. Using these set of probabilities, we need to predict (or) determine the sequence of observable states given the set of observed sequence of states.

Now we have seen the structure of an HMM, we will see the algorithms to compute things with them. There are four algorithms to solve the problems characterized by HMM. They are Forward-Backward Algorithm, Viterbi Algorithm, Segmental K-Means Algorithm & Baum-Welch re-Estimation Algorithm.

Using Viterbi, we can compute the possible sequence of hidden states given the observable states. We will see what Viterbi algorithm is.

#### Viterbi Algorithm

This algorithm finds the maximum probability of any path to arrive at the state, i , at time t that also has the correct observations for the sequence up to time t.

The idea is to propose multiple hidden state sequence to available observed state sequences. Then we would calculate the maximum likelihood estimate using the probabilities at each state that drive to the final state. In another word, it finds the best path of hidden states being confined to the constraint of observed states that leads us to the final state of the observed sequence

**Summary**

In this post, we understood the below points:

- What is meant by Markov Model?
- What is meant by Hidden Markov Models?
- Conditions to apply HMM with an example.
- Difference between Markov Model & Hidden Markov Model
- Notations of HMM
- Viterbi Algorithm

**References**

- http://www.blackarbs.com/blog/introduction-hidden-markov-models-python-networkx-sklearn/2/9/2017
- https://en.wikipedia.org/wiki/Hidden_Markov_model
- http://www.iitg.ac.in/samudravijaya/tutorials/hmmTutorialDugadIITB96.pdf

import numpy as np import pandas as pd import networkx.drawing.nx_pydot as gl import networkx as nx import matplotlib.pyplot as plt from pprint import pprint ##matplotlib inline # create state space and initial state probabilities states = ['O1', 'O2', 'O3'] pi = [0.25, 0.4, 0.35] state_space = pd.Series(pi, index=states, name='states') print(state_space) print(state_space.sum()) # create transition matrix # equals transition probability matrix of changing states given a state # matrix is size (M x M) where M is number of states q_df = pd.DataFrame(columns=states, index=states) q_df.loc[states[0]] = [0.4, 0.2, 0.4] q_df.loc[states[1]] = [0.45, 0.45, 0.1] q_df.loc[states[2]] = [0.45, 0.25, .3] print(q_df) q = q_df.values print('\n') print(q, q.shape) print('\n') print(q_df.sum(axis=1)) from pprint import pprint # create a function that maps transition probability dataframe # to markov edges and weights def _get_markov_edges(Q): edges = {} for col in Q.columns: for idx in Q.index: edges[(idx,col)] = Q.loc[idx,col] return edges edges_wts = _get_markov_edges(q_df) pprint(edges_wts) # create graph object G = nx.MultiDiGraph() # nodes correspond to states G.add_nodes_from(states) print('Nodes:\n') print(G.nodes()) print('\n') # edges represent transition probabilities for k, v in edges_wts.items(): tmp_origin, tmp_destination = k[0], k[1] G.add_edge(tmp_origin, tmp_destination, weight=v, label=v) print('Edges:') pprint(G.edges(data=True)) pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='dot') nx.draw_networkx(G, pos) # create edge labels for jupyter plot but is not necessary edge_labels = {(n1,n2):d['label'] for n1,n2,d in G.edges(data=True)} nx.draw_networkx_edge_labels(G , pos, edge_labels=edge_labels) nx.drawing.nx_pydot.write_dot(G, 'markov.dot') plt.show() # create state space and initial state probabilities hidden_states = ['S1', 'S2'] pi = [0.5, 0.5] print('\n') state_space = pd.Series(pi, index=hidden_states, name='states') print(state_space) print('\n') print(state_space.sum()) # create hidden transition matrix # a or alpha # = transition probability matrix of changing states given a state # matrix is size (M x M) where M is number of states a_df = pd.DataFrame(columns=hidden_states, index=hidden_states) a_df.loc[hidden_states[0]] = [0.7, 0.3] a_df.loc[hidden_states[1]] = [0.4, 0.6] print(a_df) a = a_df.values print('\n') print(a) print(a.shape) print('\n') print(a_df.sum(axis=1)) # create matrix of observation (emission) probabilities # b or beta = observation probabilities given state # matrix is size (M x O) where M is number of states # and O is number of different possible observations observable_states = states b_df = pd.DataFrame(columns=observable_states, index=hidden_states) b_df.loc[hidden_states[0]] = [0.2, 0.6, 0.2] b_df.loc[hidden_states[1]] = [0.4, 0.1, 0.5] print(b_df) b = b_df.values print('\n') print(b) print(b.shape) print('\n') print(b_df.sum(axis=1)) # create graph edges and weights hide_edges_wts = _get_markov_edges(a_df) pprint(hide_edges_wts) emit_edges_wts = _get_markov_edges(b_df) pprint(emit_edges_wts) # create graph object G = nx.MultiDiGraph() # nodes correspond to states G.add_nodes_from(hidden_states) print('Nodes:\n') print(G.nodes()) print('\n') # edges represent hidden probabilities for k, v in hide_edges_wts.items(): tmp_origin, tmp_destination = k[0], k[1] G.add_edge(tmp_origin, tmp_destination, weight=v, label=v) # edges represent emission probabilities for k, v in emit_edges_wts.items(): tmp_origin, tmp_destination = k[0], k[1] G.add_edge(tmp_origin, tmp_destination, weight=v, label=v) print('Edges:') pprint(G.edges(data=True)) pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='neato') nx.draw_networkx(G, pos) plt.show() # create edge labels emit_edge_labels = {(n1,n2):d['label'] for n1,n2,d in G.edges(data=True)} nx.draw_networkx_edge_labels(G , pos, edge_labels=emit_edge_labels) #plt.show() nx.drawing.nx_pydot.write_dot(G, 'hidden_markov.dot') # observation sequence of dog's behaviors # observations are encoded numerically obs_map = {'O1':0, 'O2':1, 'O3':2} obs = np.array([1,1,2,1,0,1,2,1,0,2,2,0,1,0,1]) inv_obs_map = dict((v,k) for k, v in obs_map.items()) obs_seq = [inv_obs_map[v] for v in list(obs)] print( pd.DataFrame(np.column_stack([obs, obs_seq]), columns=['Obs_code', 'Obs_seq']) ) def viterbi(pi,a,b,obs): nStates = np.shape(b)[0] T = np.shape(obs)[0] path = np.zeros(T) delta = np.zeros((nStates,T)) phi = np.zeros((nStates,T)) delta[:,0] = pi * b[:,obs[0]] phi[:,0] = 0 for t in range(1,T): for s in range(nStates): delta[s,t] = np.max(delta[:,t-1]*a[:,s])*b[s,obs[t]] phi[s,t] = np.argmax(delta[:,t-1]*a[:,s]) path[T-1] = np.argmax(delta[:,T-1]) for t in range(T-2,-1,-1): #path[t] = phi[int(path[t+1]): int(t+1) , int(t+1)] path[t] = phi[int(path[t+1]) , int(t+1)] return path,delta, phi path, delta, phi = viterbi(pi, a, b, obs) print('\n') print('single best state path: ', path) print('delta:\n', delta) print('phi:\n', phi) state_map = {0:'S1', 1:'S2'} state_path = [state_map[v] for v in path] result = (pd.DataFrame() .assign(Observation=obs_seq) .assign(Best_Path=state_path)) print(result)

Deepak is a Big Data technology-driven professional and blogger in open source Data Engineering, Machine Learning, and Data Science. He extensively works in Data gathering, modeling, analysis, validation and architecture/solution design to build next-generation analytics platform.