Extracting Named Entities

Let's see how we can extract named entities.

In many NLP applications, including semantic parsing, we start looking for meaning in a text by examining the entity types and placing an entity extraction component into our NLP pipelines. Named entities play a key role in understanding the meaning of user text.

We'll also start a semantic parsing pipeline by extracting the named entities from our corpus. To understand what sort of entities we want to extract, first, we'll get to know the ATIS dataset.

Getting to know the ATIS dataset

Throughout this chapter, we'll work with the ATIS corpus. ATIS is a well-known dataset; it's one of the standard benchmark datasets for intent classification. The dataset consists of customer utterances who want to book a flight and get information about the flights, including flight costs, flight destinations, and timetables.

No matter what the NLP task is, you should always go over your corpus with the naked eye. We want to get to know our corpus so that we integrate our observations of the corpus into our code. While viewing our text data, we usually keep an eye on the following:

  • What kind of utterances are there? Is it a short text corpus, or does the corpus consist of long documents or medium-length paragraphs?

  • What sort of entities does the corpus include? People's names, city names, country names, organization names, and so on. Which ones do we want to extract?

  • How is punctuation used? Is the text correctly punctuated, or is no punctuation used at all?

  • How are the grammatical rules followed? Is the capitalization correct? Did users follow the grammatical rules? Are there misspelled words?

Before starting any processing, we'll examine our corpus. Let's go ahead and load the dataset:

The dataset is a two-column CSV file. First, we'll get some insights into the dataset statistics with pandas. pandas is a popular data manipulation library that is frequently used by data scientists:

  1. Let's begin by reading the CSV file into Python. We'll use the read_csv method of pandas:

Get hands-on with 1200+ tech skills courses.