Subscribe to DSC Newsletter

Generating Text Using a Markov Model

A Markov Chain is a random process, where we assume the previous state(s) hold sufficient predictive power in predicting the next state. Unlike flipping a coin, these events are dependent. It’s easier to understand through an example.

Imagine the weather can only be rainy or sunny. That is, the state space is rainy or sunny. We can represent our Markov model as a transition matrix, with each row being a state, and each column being the probability it moves to another.


However, it’s easier to understand with this state transition diagram.


In other words, given today is sunny, there is a .9 probability that tomorrow will be sunny, and a .1 probability that tomorrow will be rainy.

Text Generator

One cool application of this is a language model, in which we predict the next word based on the current word(s). If we just predict based on the last word, it is a first-order Markov model. If we use the last two words, it’s a second-order Markov model.

In my example I trained the model using Walden by Henry Thoreau. I also included files of Thus Spoke Zarathustra by Nietszche , and some speeches by Obama to make it easy to experiment. The cool thing about this is that whatever text you train it on, the model spits out really similar text.

First we have to import NLTK, the best NLP library in Python. I would say the natural language processing we’re doing here to be pretty mild, but their built in functions save me a lot of code. We then turn the string (taken from the text file) into an array using the split() function.

import nltk import random  
file = open('Text/Walden.txt', 'r')
walden =
walden = walden.split()

These two functions are the meat of the code. The Conditional Frequency Dictionary from NLTK that we are going to eventually use has to take the array as pairs. So the phrase “Hi my name is Alex” would becomes [(“Hi”, “my”), (“my, “name”), (“name”, “is”), (“is”, “Alex”)]. The makePairs function takes in an array (a string split by word) and outputs an array in the above format.

The generate method takes in a conditional frequency distribution. Think – how many times did each word appear after ‘farm’? That is what a conditional frequency distribution outputs (for all words, not just ‘farm’). The rest of the generate function does is output text based on the distribution observed in the training data. I did this by making an array with each word that appeared after the current word. The array also has the correct counts, so then I can just randomly pick a word from the array and the distribution will be obeyed.

def makePairs(arr): pairs = []
for i in range(len(arr)):
if i < len(arr)-1:
temp = (arr[i], arr[i+1])
return pairs

def generate(cfd, word = 'the', num = 50):
for i in range(num):
arr = [] # an array with the words shown by count
for j in cfd[word]:
for k in range(cfd[word][j]):

print(word, end=' ')
word = arr[int((len(arr))*random.random())]

With these final three lines, we output some text, that looks a lot like Walden.

pairs = makePairs(walden) 
cfd = nltk.ConditionalFreqDist(pairs)


the question, perhaps from the leisure fruitful. "But," says pertinently that sound asleep thus. Yet his gentle rain while to choose. If he was nearly level, and wisely; as he sprang upon the entrance of goldenrod (Solidago stricta) grows a half-hour's nap after practising various animals of some cheerful it

I encourage you to look at the iPython notebook on my github, since I continued to make a method where all you have to do is enter a file name, and it will output generated text. The Obama example is also pretty cool.

If you want to try this yourself, just make a text file and put it in the correct directory.

P.S. There is an awesome conversation simulator called megaHAL that uses Markov Models!


View the original post, and others from the author here.

Views: 2989


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service