Search⌘ K
AI Features

Merging and Splitting Tokens

Explore how to merge multiword named entities into single tokens and split tokens to correct typos or customize tokenization in spaCy. Understand using doc.retokenize to adjust token spans while maintaining linguistic attributes and updating dependency trees to enhance NLP accuracy and flexibility.

We'll cover the following...

Overview

We extracted the name entities in the previous section, but what if we want to unite or split multiword named entities? And what if the tokenizer performed this not so well on some exotic tokens, and we want to split them by hand? In this lesson, we'll cover a very practical remedy for our multiword expressions, multiword named entities, and typos.

doc.retokenize is the correct tool for merging and splitting the spans. Let's see an example of retokenization by merging a multiword named entity, as follows:

Python 3.5
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("She lived in New Hampshire.")
print(doc.ents)
print([(token.text, token.i) for token in doc])
print(len(doc))
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[3:5], attrs={"LEMMA":"new hampshire"})
print(doc.ents)
print([(token.text, token.i) for token in doc])

This is what we did in the preceding code:

  • Line 3: We created a doc object from the sample sentence.

  • Line 4: We printed its entities with doc.ents, and the result was New Hampshire, as expected.

  • Line ...