Named Entity Recognition with RNNs: Preparing Data
Learn how to use RNNs to identify various entities mentioned in a text corpus.
We'll cover the following
Now, let’s look at our first task: using an RNN to identify named entities in a text corpus. This task is known as named entity recognition (NER). We’ll be using a modified version of the well-known Conference on Computational Natural Language Learning 2003 (CoNLL 2003) dataset for NER.
CoNLL 2003 is available for multiple languages, and the English data was generated from a Reuters corpus that contains news stories published between August 1996 and August 1997. The database we’ll be using is found on the website and is called CoNLLPP. It’s a more closely curated version than the original CoNLL, which contains errors in the dataset induced by incorrectly understanding the context of a word. For example, in the phrase “Chicago won ...” Chicago was identified as a location, whereas it’s actually an organization.
Understanding the data
We have defined a function called download_data()
, which can be used to download the data. We won’t go into the details of it because it simply downloads several files and places them in a data folder. Once the download finishes, we’ll have three files:
data\conllpp_train.txt
: A training set that contains 14041 sentences.data\conllpp_dev.txt
: A validation set that contains 3,250 sentences.data\conllpp_test.txt
: A test set that contains 3,452 sentences.
Next up, we’ll read the data and convert it into a specific format that suits our model. But before that, we need to see what our data looks like originally:
Get hands-on with 1200+ tech skills courses.