Search⌘ K
AI Features

Customizing the Tokenizer and Sentence Segmentation

Explore how to customize spaCy's tokenizer by adding special case rules for domain-specific terms and understand the complexity of sentence segmentation. Learn to debug tokenization processes and use spaCy's dependency parser for accurate sentence boundary detection, preparing you for effective token-level text processing.

When we work with a specific domain, such as medicine, insurance, or finance, we often come across words, abbreviations, and entities that need special attention. Most domains we'll process have characteristic words and phrases that need custom tokenization rules. Here's how to add a special case rule to an existing Tokenizer class instance:

Python 3.5
import spacy
from spacy.symbols import ORTH
nlp = spacy.load("en_core_web_md")
doc = nlp("lemme that")
print([w.text for w in doc])
special_case = [{ORTH: "lem"}, {"ORTH": "me"}]
nlp.tokenizer.add_special_case("lemme", special_case)
print([w.text for w in nlp("lemme that")])

Here is what we did:

  • We again started by importing spacy.

  • Then, we imported the ORTH symbol, which means orthography; that is, text.

  • We continued ...