Sentence segmentation in different languages using spaCy
Sentence segmentation is the process of dividing a chunk of text or a paragraph into individual sentences. This task requires us to identify the boundaries that separate one sentence from another. It is a fundamental task in natural language processing (NLP) and is often an essential preprocessing step for NLP applications as it makes parsing and analysis easier.
Sentence segmentation in spaCy
The spaCy library offers a very simple and easy way for sentence segmentation. We can use the sents property, which is a part of the built-in Doc class. spaCy achieves this using a dependency parser; no other library uses such a sophisticated method of handling sentence segmentation. spaCy also allows us to perform sentence segmentation in different languages by loading different language models.
For our example, we will be using the Spanish and French language models. Let's start with the Spanish example.
import spacynlp = spacy.load("es_core_news_sm")text = "¿Querías saber cuánto durará esto? Hasta la muerte"doc = nlp(text)for sent in doc.sents:print(sent.text)
Let's go over the code:
Line 1: We import the
spacylibrary.Line 2: We load the Spanish language model.
Line 4–5: We store the Spanish text in a variable called
textand add it to andocobject.Line 7–8: We use the
sentsproperty that is inside thedocclass to loop through the text and print the sentences.
Now let's look at the French example.
import spacynlp = spacy.load("fr_core_news_sm")text = "Frère Jacques Frère Jacques Dormez vous? Dormez vous? Sonnez les matines Sonnez les matines Ding ding dong Ding ding dong"doc = nlp(text)for sent in doc.sents:print(sent.text)
The code is largely the same except for two differences:
Line 2: We load a French language model.
Line 4: We add a French text that we want to be segmented.
Free Resources