Implementing Hidden Markov Models(HMM) for Information Extraction in Perl


Introduction

HMM are very often used for Information Extraction (IE). There are a lot of interesting articles, tutorials on the Internet about theory behind HMM and applications HMM to different areas like IE, speech recognition and some other.Some of resources can be found at [1],[2],[3],[4]. Focus of this tutorial is not theory or algorithms of HMM. We will investigate how toimplement HMM in Perl. The end goal is to implement solution to typical web mining (IE) task like populating database with extracted words from the text.

IE with HMM has following steps : the extraction of needed fields from corpus of data, the training (or learning) HMM parameters,and finally learning the structure of HMM (can be skipped if fixed structure is used). To make our job easy we will develop source code for simple numerical example and then move to text mining.

The decoding problem

The first problem is decoding problem. For given observation we need to detect what states have the highest probability to produce this observations. One of the state that has needed information is target state. The Viterbi algorithm can be used for this task. Description, theory of algorithm can be found at [4],[5].[4] has also source code in Pyphon. The example considered in this article has 2 states (Rainy, Sunny) and 3 observations (walk, shop, clean).Here is the Perl solution to the same example. Script1The function calculates and prints total, viterbi path, probability of viterbi path.In the future with text mining problem we will be specifically interested at this viterbi path. It can tell what state it was for given observation and if this state is what we need we can extract and save the info.

Leaning of HMM parameters with gradient-based method

But let's still consider this numerical example. Before using HMM for decoding we need to know HMM parameters,transmission probabilities, emission probabilities, starting probabilities. There are several types of data that can be used [5] and several algorithms.We will consider Gradient based method described in [1]Here is the source code in Perl to implement learning.Script2The script has following steps
  1. Generate training data based on given probabilities.
  2. Initialize transmissions, emissions probabilities to some values.
  3. Calculate forward variables.
  4. Calculate backward variables.
  5. Calculate gradient transition.
  6. Calculate gradient emission.
  7. Update variables(transmissions, emissions probabilities).
  8. Repeat steps 2-7 until done.
In real problem the first step is not necessary but labeled data for training should be obtained. Calculating forward variables is the same as in decoding problem, but now we are saving for all observations and states in 2 dimensional array.In similar way we are calculating backward variables but starting from the end observation.

Learning structure for HMM

The above script for learning is assuming fixed HMM structure. However sometimes we don't know what structure should be used. But as shown in papers [3], [5],[6] it can be learned from data also. One of the way to learn structure is using stochastic optimization. We will use Hill climbing algorithm. In two words this algorithm evaluate possible paths and choose which is better and repeats this process.Consider again the simple example with 4 states.Here is the perl source for script for finding 4 states structure for given training data using hill climbing algorithm.Script3The square matrix data structure is used for representing HMM structure. The size of matrix is the number of states.If there is a link from state 2 to state 3 then element of matrix (2,3) has value 1 otherwise it equal to 0.To create new structure from the current matrix we need just change any element of matrix from 0 to 1.Some of structures will not valid but it's not checked here. To evaluate matrix we calculate the number of different positions in this matrix and all training data. Smaller difference should indicate that the matrix has better fit.

Using HMM with the text data

The source code that we used above for numerical examples can be used also for text data.The simplest structure that is used in many papers is 4 state structure with states called target, background, prefix, and suffix.We are interesting in target state. The observations are now the text itself.The question is how do we separate text or what is one observation?Thus text should be preprocessed. It should be broken at tokens, where each token correspond one observation.For small domain some tokens can be predefined.Here is the algorithm that is checking if given word or words are in this list. Script4It converts text to the sequence of numbers for further processing with HMM. So now we can use methods described above.There are a lot of things that should be taken care of. For example plural, suffix, prefix, compound words (MS Access).We don't consider this in our algorithm.Thus we saw how HMM can be used for information extraction. Namely we used gradient based method for learning parameters, viterbi algorithm for extraction and stochastic optimization for learning structure. We implemented the source code in Perl for these methods.And this source code can be used for IE after the preprocessing of the text.

References

1. Hidden Markov Models by Narada Warakagoda.
2. Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore, A Machine Learning Approach to Building Domain-Specific Search Engines, The Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99).
3. Nancy R. Zang, Hidden Markov Models for Information Extraction June,2001.
4. Viterbi algorithm, From Wikipedia, the free encyclopedia.
5. Kristie Seymore, Andrew McCallum, Ronald Rosenfeld, Learning Hidden Markov Model Structure for Information Extraction, AAAI 99 Workshop on Machine Learning for Information Extraction.
6. Freitag D., & McCallum, A., Information extraction with HMM structures learned by stochastic optimization.
In Proceedings of the Eighteenth Conference on Artificial Intelligence (AAAI-2000).