Overview of spaCy Conventions

Explore spaCy's conventions to understand its text processing pipeline and core components like tokens, Doc, and Vocab. Learn how spaCy handles tokenization, tagging, parsing, and entities through an efficient pipeline to simplify NLP development.

We'll cover the following...

Overview of spaCy

Overview of spaCy

Every NLP application consists of several steps of processing the text. As we saw previously, we have always created instances called nlp and doc. But what did we do exactly?

When we call nlp on our text, spaCy applies some processing steps. The first step is tokenization to produce a Doc object. The Doc object is then processed further with a tagger, a parser, and an entity recognizer. This way of processing the text is called a language processing pipeline. Each pipeline component returns the processed Doc and then passes it to the next component:

What happened exactly in the preceding code is as follows:

We started by importing spaCy.
In the second line, spacy.load() returned a Language class instance, nlp. The Language class is the text processing pipeline.
After that, we applied nlp on the sample sentence I went there and got a Doc class instance, doc.

The Language class applies all of the preceding pipeline steps to our input sentence behind the scenes. After applying nlp to the sentence, the Doc object contains tokens that are tagged, lemmatized, and marked as entities if the token is an entity (we will go into detail about what those are and how it's done later). Each pipeline component has a well-defined task, as seen in the table below:

Processing Pipeline

Type	Description
Language	A text processing pipeline. Usually, we load this once per process as nlp and pass the instance around our application.
Tokenizer	Segment text and create `Doc` objects with the discovered segment boundaries.
Lemmatizer	Determine the base form of words.
Morphology	Assign linguistic features like lemmas, noun case, verb tense, etc., based on the word and its part-of-speech tagging.
Tagger	Annotate part-of-speech tags on `Doc` objects.
DependencyParser	Annotate syntactic dependencies on `Doc` objects.
EntityRecognizer	Annotate named entities, e.g. persons or products, on `Doc` objects.
Matcher	Match sequences of tokens, based on pattern rules, similar to regular expressions.
PhraseMatcher	Match sequences of tokens, based on phrases.
EntityRuler	Add entity spans to the `Doc` using token-based rules or exact phrase matches.
Sentencizer	Implement custom sentence boundary detection logic that helps doesn't require the dependency parse.

We can divide the objects composing the preceding spaCy architecture into two: containers and processing pipeline components. We'll first learn about two basic components, Tokenizer and Lemmatizer, and then we'll explore Container objects further.

spaCy does all these operations for us behind the scenes, allowing us to concentrate on our own application's development. With this level of abstraction, using spaCy for NLP application development is no coincidence. Let's start with the Tokenizer class and see what it offers us; then, we will explore all the container classes one by one.

Name	Component	Creates	Description
tokenizer	Tokenizer	`Doc`	Segment text into tokens.
tagger	Tagger	`Doc[i].tag`	Assign part-of-speech tags.
parser	DependencyParser	`Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunks`	Assign dependency labels.
ner	EntityRecognizer	`Doc.ents, Doc[i].ent_iob, Doc[i].ent_type`	Detect and labels names entities.

Name	Description
Doc	A container for accessing lingusitic annotations.
Span	A slice from a `Doc` object.
Token	An individual token, i.e., a word, punctuation symbol, whitespace, etc.
Lexeme	An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, etc.

Name	Description
Vocab	A lookup table for the vocabulary that allows us to access the Lexeme objects.
StringStore	Map strings to and from hash values.
Vectors	Container class for vector data keyed by string.
GoldParse	Collection for training annotations.
GoldCorpus	An annotated corpus using the JSON file format. Managed annotations for tagging, dependency parsing, and NER.

1.Getting Started

2.Core Operations with spaCy

3.Linguistic Features

4.Rule-Based Matchmaking

5.Working with Word Vectors and Semantic Similarity

6.Putting Everything Together: Semantic Parsing with spaCy

Assessment

Project

7.Customizing spaCy Models

8.Text Classification with spaCy

9.spaCy and Transformers

10.Putting Everything Together: Designing a Chatbot with spaCy

11.Appendix

12.Conclusion

Assessment

Overview of spaCy Conventions

Overview of spaCy

Processing Pipeline

Container objects

Other classes