Build a Deep Learning Text Generator Project with Markov Chains

Nov 03, 2020 - 10 min read
Ryan Thelin
editor-page-cover

Natural language processing (NLP) and deep learning are growing in popularity for their use in ML technologies like self-driving cars and speech recognition software.

As more companies begin to implement deep learning components and other machine learning practices, the demand for software developers and data scientists with proficiency in deep learning is skyrocketing.

Today, we will introduce you to a popular deep learning project, the Text Generator, to familiarize you with important, industry-standard NLP concepts, including Markov chains.

By the end of this article, you’ll understand how to build a Text Generator component for search engine systems and know how to implement Markov chains for faster predictive models.

Here’s what we’ll cover today:



Learn how to build 12 industry-standard NLP projects.

Build real-world NLP and deep learning applications with the most popular machine learning tools: NumPy, Matplotlib, scikit-learn, Tensorflow, and more.

Building Advanced Deep Learning and NLP Projects



Introduction to the Text Generator Project

Text generation is popular across the board and in every industry, especially for mobile, app, and data science. Even journalism uses text generation to aid writing processes.

You’ve probably encountered text generation technology in your day-to-day life. iMessage text completion, Google search, and Google’s Smart Compose on Gmail are just a few examples. These skills are valuable for any aspiring data scientist.

Today, we are going to build a text generator using Markov chains. This will be a character based model that takes the previous character of the chain and generates the next letter in the sequence.

By training our program with sample words, our text generator will learn common patterns in character order. The text generator will then apply these patterns to the input, an incomplete word, and output the character with the highest probability to complete that word.

Let’s suppose we have a string, monke. We need to find the character that is best suited after the character e in the word monke based on our training corpus.

Our text generator would determine that y is sometimes after e and would form a completed word. In other words, we are going to generate the next character for that given string.

svg viewer

The text generator project relies on text generation, a subdivision of natural language processing that predicts and generates next characters based on previously observed patterns in language.

Without NLP, we’d have to create a table of all words in the English language and match the passed string to an existing word. There are two problems with this approach.

  • It would be very slow to search thousands of words
  • The generator could only complete words that it had seen before.

NLP allows us to dramatically cut runtime and increase versatility because the generator can complete words it hasn’t even encountered before. NLP can be expanded to predict words, phrases, or sentences if needed!

For this project, we will specifically be using Markov chains to complete our text. Markov processes are the basis for many NLP projects involving written language and simulating samples from complex distributions.

Markov processes are so powerful that they can be used to generate superficially real-looking text with only a sample document.


What are Markov Chains?

A Markov chain is a stochastic process that models a sequence of events in which the probability of each event depends on the state of the previous event. The model requires a finite set of states with fixed conditional probabilities of moving from one state to another

The probability of each shift depends only on the previous state of the model, not the entire history of events.

For example, imagine you wanted to build a Markov chain model to predict weather conditions.

We have two states in this model, sunny or rainy. There is a higher probability (70%) that it’ll be sunny tomorrow if we’ve been in the sunny state today. The same is true for rainy, if it has been rainy it will most likely continue to rain.

However, it’s possible (30%) that the weather will shift states, so we also include that in our Markov chain model.

svg viewer
Example of Markov chain states

The Markov chain is a perfect model for our text generator because our model will predict the next character using only the previous character. The advantage of using a Markov chain is that it’s accurate, light on memory (only stores 1 previous state), and fast to execute.


Text Generation Project Implementation

We’ll complete our text generator project in 6 steps:

  1. Generate the lookup table: Create table to record word frequency
  2. Convert frequency to probability: Convert our findings to a usable form
  3. Load the dataset: Load and utilize a training set
  4. Build the Markov chains: Use probabilities create chains for each word and character
  5. Sample our data: Create a function to sample individual sections of the corpus
  6. Generate text: Test our model
svg viewer

1. Generate the lookup table

First, we’ll create a table that records the occurrences of each character state within our training corpus. We will save the last ‘K’ characters and the ‘K+1’ character from the training corpus and save them in a lookup table.

For example, imagine our training corpus contained, “the man was, they, then, the, the”. Then the number of occurrences by word would be:

  • “the” - 3
  • “then” - 1
  • “they” - 1
  • “man” - 1

Here’s what that would look like in a lookup table:

X Y Frequency
the " " 3
the “n” 2
the “y” 1
the “i” 1
man " " 1

In the example above, we have taken K = 3. Therefore, we’ll consider 3 characters at a time and take the next character (K+1) as our output character.

In the above lookup table, we have the word (X) as the and the output character (Y) as a single space (" "). We have also calculated how many times this sequence occurs in our dataset, 3 in this case.

We’ll find this data for each word in the corpus to generate all possible pairs of X and Y within the dataset.

Here’s how we’d generate a lookup table in code:

def generateTable(data,k=4):
    
    T = {}
    for i in range(len(data)-k):
        X = data[i:i+k]
        Y = data[i+k]
        #print("X  %s and Y %s  "%(X,Y))
        
        if T.get(X) is None:
            T[X] = {}
            T[X][Y] = 1
        else:
            if T[X].get(Y) is None:
                T[X][Y] = 1
            else:
                T[X][Y] += 1
    
    return T

T = generateTable("hello hello helli")
print(T)

Explanation

  • On line 3, we created a dictionary that is going to store our X and its corresponding Y and frequency value. Try running the above code and see the output.

  • From line 9 to line 17, we checked for the occurrence of X and Y, and, if we already have the X and Y pair in our lookup dictionary, then we just increment it by 1.


2. Convert frequencies to probabilities

Once we have this table and the occurances, we’ll generate the probability that an occurance of Y will appear after an occurance of a given X. Our equation for this will be:


FrequencyofYwithXSumofTotalFrequencies\frac {Frequency of Y with X}{Sum of Total Frequencies}


For example, if X = the and Y = n our equation would look like this:

  • Frequency that Y = n when X = the: 2
  • Total frequency in the table: 8
  • Therefore: P=2/8P = {2}/{8} =0.125= 0.125 =12.5= 12.5%

Here’s how we’d apply this equation to convert our lookup table to probabilities usable with Markov chains:

def convertFreqIntoProb(T):     
    for kx in T.keys():
        s = float(sum(T[kx].values()))
        for k in T[kx].keys():
            T[kx][k] = T[kx][k]/s
                
    return T
 
T = convertFreqIntoProb(T)
print(T)

Explanation

  • We summed up the frequency values for a particular key and then divided each frequency value of that key by that summed value to get our probabilities. Simple logic!

3. Load the dataset

Next we’ll load our real training corpus, you can use long text (.txt) doc that you want.

We’ll use a political speech to provide enough words to teach our model.

text_path = "train_corpus.txt"
def load_text(filename):
    with open(filename,encoding='utf8') as f:
        return f.read().lower()
    
text = load_text(text_path)
print('Loaded the dataset.')

This data set will give our generator enough occurrences to make reasonably accurate predictions. As with all machine learning, larger training corpuses will result in more accurate predictions.


4. Build the Markov chains

Now let’s construct our Markov chains and associate the probabilities with each character. We’ll use the generateTable() and convertFreqIntoProb() functions created in step 1 and step 2 to build the Markov models.

main.py
train_corpus.txt
My dear countrymen,

Many of you wish many-many good wishes of the holy festival of independence.

Today the country is full of confidence. The country is crossing the new heights by plowing the resolve of dreams with hard work. Today's sunrise has brought a new consciousness, new excitement, new excitement, new energy.

Our lovely countrymen, once in 12 years, flowers of Nilakurinya grow in our country. This year, on the hills of Nilgiris in the south, it is like our Nilkurinji flower like the Ashok Chakra of the Tricolor flag, in the festival of freedom of the country.

My dear countrymen, we are celebrating this festival of independence, when our daughters Uttarakhand, Himachal, Manipur, Telangana, Andhra Pradesh - our daughters of these states crossed seven seas and coloring the seven seas with a color of tricolor Came back

My dear countrymen, we are celebrating the festival of independence at that time, when Everest triumphs were so many, many of our heroes, many of our daughters went to the Everest and hoisted the Tricolor flag. But in the celebration of this freedom, I will remember that the tribal children living in far-off jungles have increased the glory of the Tricolor flag by hoisting the Tricolor flag on Everest.


My dear countrymen, the sessions of the Lok Sabha and Rajya Sabha have just been fulfilled. You must have seen that the House ran very well and in a sense this session of Parliament was entirely devoted to social justice. To protect their rights, our Parliament made social justice more forceful with sensitivity and awareness, to be oppressed, oppressed, exploited, deprived, women, to protect their rights.

The OBC Commission was demanding for a constitutional place for years. This time Parliament has tried to protect their rights by giving a constitutional order to backward, backward, by giving constitutional status to that commission.

We are celebrating the festival of independence at that time, when those news in our country brought new consciousness to the country, with whom every Indian who is not in any corner of the world, today is proud of the fact that India Has registered its name in the world's sixth largest economy. In such a positive environment, among the series of positive events we are celebrating the festival of independence today.

In order to give freedom to the country, millions of people spent their lives in the Jubilee prisons under the leadership of Pujya Bapu. Many revolutionary great men hanged on the hanging frames and kissed the hanging for the country's independence. I heartily greet these brave fighters of independence from the countrymen today, I bow my heart to the eternal glory of the tricolor, inspiring us to live and die, the tricolor of For the sake of the army of the army of the country, our soldiers sacrifice their lives, our paramilitary forces spend life, the soldiers of our police force, in the service of the country day and night to protect the common man. Live gay

I bow down to the ranks of the Red Fort in the evidence of the Tricolor flag today for all the soldiers of the army, the paramilitary forces, the police personnel, for their great service, for their sacrifice and happiness, for their power and happiness. I am very happy and give them a lot of luck.

These days, reports of good rainfall are coming from different corners of the country, along with flood reports are coming along. Those families who have lost their loved ones due to overcrowding and floods, who have suffered difficulties, have been standing in their help with the full power of the country and those who have lost their lives, I am involved in their misery.

My dear countrymen, the next Baisakhi is going to be a hundred years of massacres of our Jalianwala Bagh. How ordinary people of the country had betrayed life for the country's independence and how long had the boundaries of oppression passed? Jalianwala Bagh gives the message of sacrifice and sacrifice of those heroes of our country. I heartily respect all those heroes.

Explanation

  • On line 1, we created a method to generate the Markov model. This method accepts the text corpus and the value of K, which is the value telling the Markov model to consider K characters and predict the next character.

  • On line 2, we generated our lookup table by providing the text corpus and K to our method, generateTable(), which we created in the previous lesson.

  • On line 3, we converted the frequencies into the probabilistic values by using the method, convertFreqIntoProb(), which we also created in the previous lesson.


5. Sample the text

Now, we’ll create a sampling function that takes the unfinished word (ctx), the Markov chains model from step 4 (model), and the number of characters used to form the word’s base (k).

We’ll use this function to sample passed context and return the next likely character with the probability it is the correct character.

main.py
train_corpus.txt
import numpy as np

def sample_next(ctx,model,k):
 
    ctx = ctx[-k:]
    if model.get(ctx) is None:
        return " "
    possible_Chars = list(model[ctx].keys())
    possible_values = list(model[ctx].values())
    
    print(possible_Chars)
    print(possible_values)
 
    return np.random.choice(possible_Chars,p=possible_values)
 
sample_next("commo",model,4)

Explanation

  • The function, sample_next(ctx,model,k), accepts three parameters: the context, the model, and the value of K.

  • The ctx is nothing but the text that will be used to generate some new text. However, only the last K characters from the context will be used by the model to predict the next character in the sequence.

  • For example, we passed the value of context as commo and value of K = 4, so the context, which the model will look to generate the next character, is of K characters long and hence, it will be ommo because the Markov models only take the previous history. You can see the value of the context variable by printing it too.

  • On line 9 and 10, we printed the possible characters and their probability values, which are also present in our model. We got the next predicted character as n, and its probability is 1.0. It makes sense because the word commo is more likely to be common after generating the next character.

  • On line 12, we returned a sampled character according to the probabilistic values as we discussed above.


6. Generate text

Finally, we’ll combine all the above functions to generate some text.

main.py
train_corpus.txt
def generateText(starting_sent,k=4,maxLen=1000):
    
    sentence = starting_sent
    ctx = starting_sent[-k:]
    
    for ix in range(maxLen):
        next_prediction = sample_next(ctx,model,k)
        sentence += next_prediction
        ctx = sentence[-k:]
    return sentence
 
print("Function Created Successfully!")
 
text = generateText("dear",k=4,maxLen=2000)
print(text)

Explanation

  • The above function takes in three parameters: the starting word from which you want to generate the text, the value of K, and the maximum length of characters up to which you need the text.
  • If you run the code, you’ll get a speech that starts with “dear” and has a total of 2000 characters.

While the speech likely doesn’t make much sense, the words are all fully formed and generally mimic familiar patterns in words.


What to learn next

Congratulations on completing this text generation project. You now have hands-on experience with Natural Language Processing and Markov chain models to use as you continue your deep learning journey.

Your next steps are to adapt the project to produce more understandable output, learn a tool like GPT-3, or to try some more awesome machine learning projects like:

  • Pokemon classification system
  • Emoji predictor using NLP
  • Text decryption using recurrent neural network

To walk you through these projects and more, Educative has created Building Advanced Deep Learning and NLP Projects. This course gives you the chance to practice advanced deep learning concepts as you complete interesting and unique projects like the one we did today. By the end, you’ll have the experience to use any of the top deep learning algorithms on your own projects.

Happy learning!


Continue reading about NLP and Machine Learning


WRITTEN BYRyan Thelin

Join a community of 500,000 monthly readers. A free, bi-monthly email with a roundup of Educative's top articles and coding tips.