Extracting Named Entities
Explore how to extract named entities from text using spaCy’s Matcher to improve NLP applications. This lesson guides you through analyzing the ATIS dataset, identifying key entities such as locations, organizations, dates, times, and abbreviations, and applying custom patterns for accurate extraction. Understand how named entities contribute to semantic parsing and enhance NLP pipeline development.
We'll cover the following...
In many NLP applications, including semantic parsing, we start looking for meaning in a text by examining the entity types and placing an entity extraction component into our NLP pipelines. Named entities play a key role in understanding the meaning of user text.
We'll also start a semantic parsing pipeline by extracting the named entities from our corpus. To understand what sort of entities we want to extract, first, we'll get to know the ATIS dataset.
Getting to know the ATIS dataset
Throughout this chapter, we'll work with the ATIS corpus. ATIS is a well-known dataset; it's one of the standard benchmark datasets for intent classification. The dataset consists of customer utterances who want to book a flight and get information about the flights, including flight costs, flight destinations, and timetables.
No matter what the NLP task is, you should always go over your corpus with the naked eye. We want to get to know our corpus so that we integrate our observations of the corpus into our code. While viewing our text data, we usually keep an eye on the following:
What kind of utterances are there? Is it a short text corpus, or does the corpus consist of long documents or medium-length paragraphs?
What sort of entities does the corpus include? People's names, city names, country names, organization names, and so on. Which ones do we want to extract?
How is punctuation used? Is the text correctly punctuated, or is no punctuation used at all?
How are the grammatical rules followed? Is the capitalization correct? Did users follow the grammatical rules? Are there misspelled words?
Before starting any processing, we'll examine our corpus. Let's go ahead and load the dataset:
The dataset is a two-column CSV file. First, we'll get some insights into the dataset statistics with pandas. pandas is a popular data manipulation library that is frequently used by data scientists:
Let's begin by reading the CSV file into Python. We'll use the
read_csvmethod of ...