Natural Language Processing with TensorFlow/

...

Named Entity Recognition with RNNs: Preparing Data

Learn how to use RNNs to identify various entities mentioned in a text corpus.

We'll cover the following...

Understanding the data
Processing data

Now, let’s look at our first task: using an RNN to identify named entities in a text corpus. This task is known as named entity recognition (NER). We’ll be using a modified version of the well-known Conference on Computational Natural Language Learning 2003 (CoNLL 2003) dataset for NER.

CoNLL 2003 is available for multiple languages, and the English data was generated from a Reuters corpus that contains news stories published between August 1996 and August 1997. The database we’ll be using is found on the website and is called CoNLLPP. It’s a more closely curated version than the original CoNLL, which contains errors in the dataset induced by incorrectly understanding the context of a word. For example, in the phrase “Chicago won ...” Chicago was identified as a location, whereas it’s actually an organization.

Understanding the data

We have defined a function called download_data(), which can be used to download the data. We won’t go into the details of it because it simply downloads several files and places them in a data folder. Once the download finishes, we’ll have three files:

data\conllpp_train.txt: A training set that contains 14041 sentences.
data\conllpp_dev.txt: A validation set that contains 3,250 sentences.
data\conllpp_test.txt: A test set that contains 3,452 sentences.

Next up, we’ll read the data and convert it into a specific format that suits our model. But before that, we need to see what our data looks like originally:

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

Named Entity Recognition with RNNs: Preparing Data

Understanding the data