Search⌘ K
AI Features

Overview of spaCy Conventions

Explore spaCy's conventions to understand its text processing pipeline and core components like tokens, Doc, and Vocab. Learn how spaCy handles tokenization, tagging, parsing, and entities through an efficient pipeline to simplify NLP development.

We'll cover the following...

Overview of spaCy

Every NLP application consists of several steps of processing the text. As we saw previously, we have always created instances called nlp and doc. But what did we do exactly?

When we call nlp on our text, spaCy applies some processing steps. The first step is tokenization to produce a Doc object. The Doc object is then processed further with a tagger, a parser, and an entity recognizer. This way of processing the text is called a language processing pipeline. Each pipeline component returns the processed Doc and then passes it to the next component:

A high-level overview of the processing pipeline
A high-level overview of the processing pipeline

A spaCy pipeline object is created when we load a language model. We load an English model and initialize a pipeline in the following code segment:

Python 3.5
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("I went there")

What happened exactly in the preceding code is as follows:

  • We started by importing spaCy.

  • In the second line, spacy.load() returned a Language class instance, nlp. The Language class is the text processing pipeline.

  • After that, we applied nlp on the sample sentence I went there and got a Doc class instance, doc.

The Language class applies all of the preceding pipeline steps to our input sentence behind the scenes. After applying nlp to the sentence, the Doc object contains tokens that are tagged, lemmatized, and marked as entities if the token is an entity (we will go into detail about what those are and how it's done later). Each pipeline component has a well-defined task, as seen in the table below:

Name

Component

Creates

Description

tokenizer

Tokenizer

Doc

Segment text into tokens.

tagger

Tagger

Doc[i].tag

Assign part-of-speech tags.

parser

DependencyParser

Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunks

Assign dependency labels.

ner

EntityRecognizer

Doc.ents, Doc[i].ent_iob, Doc[i].ent_type

Detect and labels names entities.

The spaCy language processing pipeline always depends on the statistical model and its capabilities. This is why we always load a language model with spacy.load() as the first step in our code.

Each component corresponds to a spaCy class. The spaCy classes have self-explanatory names such as Language, Doc, and Vocab. We already used Language and Doc classes—let's see all of the processing pipeline classes and their duties:

Processing Pipeline

Type

Description

Language

A text processing pipeline. Usually, we load this once per process as nlp and pass the instance around our application.

Tokenizer

Segment text and create Doc objects with the discovered segment boundaries.

Lemmatizer

Determine the base form of words.

Morphology

Assign linguistic features like lemmas, noun case, verb tense, etc., based on the word and its part-of-speech tagging.

Tagger

Annotate part-of-speech tags on Doc objects.

DependencyParser

Annotate syntactic dependencies on Doc objects.

EntityRecognizer

Annotate named entities, e.g. persons or products, on Doc objects.

Matcher

Match sequences of tokens, based on pattern rules, similar to regular expressions.

PhraseMatcher

Match sequences of tokens, based on phrases.

EntityRuler

Add entity spans to the Doc using token-based rules or exact phrase matches.

Sentencizer

Implement custom sentence boundary detection logic that helps doesn't require the dependency parse.

We shouldn't be intimated by the number of classes; each class has unique features to help us process our text better.

There are more data structures to represent text data and language data. Container classes such as Doc hold information about sentences, words, and text. There are also container classes other than Doc:

Container objects

Name

Description

Doc

A container for accessing lingusitic annotations.

Span

A slice from a Doc object.

Token

An individual token, i.e., a word, punctuation symbol, whitespace, etc.

Lexeme

An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, etc.

Finally, spaCy provides helper classes for vectors, language vocabulary, and annotations. We'll see the Vocab class often in this course. Vocab represents a language's vocabulary. Vocab contains all the words of the language model we loaded:

Other classes

Name

Description

Vocab

A lookup table for the vocabulary that allows us to access the Lexeme objects.

StringStore

Map strings to and from hash values.

Vectors

Container class for vector data keyed by string.

GoldParse

Collection for training annotations.

GoldCorpus

An annotated corpus using the JSON file format. Managed annotations for tagging, dependency parsing, and NER.

The spaCy library's backbone data structures are Doc and Vocab. The Doc object abstracts the text by owning the sequence of tokens and all their properties. The Vocab object provides a centralized set of strings and lexical attributes to all the other classes. This way, spaCy avoids storing multiple copies of linguistic data:

spaCy architecture
spaCy architecture

We can divide the objects composing the preceding spaCy architecture into two: containers and processing pipeline components. We'll first learn about two basic components, Tokenizer and Lemmatizer, and then we'll explore Container objects further.

spaCy does all these operations for us behind the scenes, allowing us to concentrate on our own application's development. With this level of abstraction, using spaCy for NLP application development is no coincidence. Let's start with the Tokenizer class and see what it offers us; then, we will explore all the container classes one by one.