...

/

Project Creation: Part Two

Project Creation: Part Two

In this lesson, we will perform some preprocessing on our dataset.

Padding

In the previous lesson, we preprocessed our data and created a numeric representation of the test sentences. We will be using the same function to work with our original dataset.

First, we will create the padding functionality.

import numpy as np
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
def pad(x, length=None):
if length is None:
length = max([len(sentence) for sentence in x])
return pad_sequences(x, maxlen=length, padding='post')
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(np.array(token_sent)))
print(' Output: {}'.format(pad_sent))

Explanation:

  • First, we imported the required packages.

  • From line 4 to line 7, we defined a function that will pad our data. We are trying to find the sequence that is of maximum length. After that, we used the pad_sequences() function to pad extra 0’s at the end by providing the padding="post" parameter and also providing the maximum length of the sequence (which is never going to be more than the maximum length).

  • On line 9 we called the pad() function on the sequences that we created in the previous lesson.

  • Finally, we printed the sequence without padding and the sequence again after padding. Take a look at the output for one of the sequences below.

    Sequence 1 in x
    Input:  [ 4  7  2  1 16 10  5 11 17  1 18  8  3 19 12  1 20  3 21  1 22 10 23 14
    6  1  3 24  2  8  1  4  7  2  1 25 13 26  9  1 27  3 28  1 15]
    Output: [ 4  7  2  1 16 10  5 11 17  1 18  8  3 19 12  1 20  3 21  1 22 10 23 14
    6  1  3 24  2  8  1  4  7  2  1 25 13 26  9  1 27  3 28  1 15  0  0  0
    0  0  0  0  0  0]
    

    You can see that ...