Search⌘ K
AI Features

High-level Overview of the spaCy Library

Explore spaCy, an industrial-strength Python library designed for natural language processing tasks. Understand its features including efficient tokenization, named entity recognition, part-of-speech tagging, and how it compares to other NLP libraries. This lesson prepares you to use spaCy for practical NLP applications with pre-trained models and seamless integration.

We'll cover the following...

What is spaCy?

spaCy is an open-source Python library for modern NLP. The creators of spaCy describe their work as industrial-strength NLP. spaCy is shipped with pre-trained language models and word vectors for 60+ languages.

spaCy is focused on production and shipping code, unlike its more academic predecessors. The most famous and frequently used Python predecessor is NLTK. NLTK's main focus was providing students and researchers with an idea of language processing. It never put any claims on efficiency, model accuracy, or being an industrial-strength library. spaCy focused on providing production-ready code from the first day. You can expect models to perform on real-world data, the code to be efficient, and the ability to process a huge amount of text data in a reasonable time. The following table is an efficiency comparison from the spaCy documentation.

SYSTEM

TOKENIZE

TAG

PARSE

spaCy

0.2ms

1ms

19ms

CoreNLP

0.18ms

10ms

49ms

ZPar

1ms

8ms

850ms

NLTK

4ms

443ms

n/a

The spaCy code is also maintained in a professional way, with issues sorted by labels and new releases covering as many fixes as possible. We can always raise an issue on the spaCy GitHub repo.

Another predecessor is CoreNLP (also known as StanfordNLP). CoreNLP is implemented in Java. Though CoreNLP competes in terms of efficiency, Python won by easy prototyping, and spaCy is much more professional as a software package. The code is well maintained, issues are tracked on GitHub, and every issue is marked with some labels (such as bug, feature, or new project). Also, the installation of the library code and the models is easy. Together with providing backward compatibility, this makes spaCy a professional software project. The table below has a comparison of spaCy and the other NLP libraries.

Feature comparison


spaCy

NLTK

CoreNLP

Programming Language

Python

Python

Java/Python

Neural Netowork Models

Yes

No

Yes

Integrated word vectors

Yes

No

No

Multi-language support

Yes

Yes

Yes

Tokenization

Yes

Yes

Yes

Part-of-speech tagging

Yes

Yes

Yes

Sentence segmentation

Yes

Yes

Yes

Dependency parsing

Yes

No

Yes

Enity recognition

Yes

Yes

Yes

Entity linking

Yes

Yes

No

Coreference resolutuion

No

No

Yes

Throughout this course, we will be using spaCy's latest release (the version used at the time of writing this course) for all our computational linguistics and ML purposes. The following are the features in the latest release:

  • Original data preserving tokenization.

  • Statistical sentence segmentation.

  • Named entity recognition.

  • Part-of-speech (POS) tagging.

  • Dependency parsing.

  • Pre-trained word vectors

  • Easy integration with popular deep learning libraries. spaCy's ML library Thinc provides thin wrappers around PyTorch, TensorFlow, and MXNet. spaCy also provides wrappers for HuggingFace Transformers by spacy-transformers library.

  • Industrial-level speed.

  • A built-in visualizer, displaCy.

  • Support for 60+ languages.

  • 46 state-of-the-art statistical models for 16 languages.

  • Space-efficient string data structures.

  • Efficient serialization.

  • Easy model packaging and usage.

  • Large community support.

We had a quick glance around spaCy as an NLP library and as a software package. We will see what spaCy offers in detail throughout the course.