Previously, we saw how to make the best of spaCy's pre-trained statistical models (including the POS tagger, NER, and dependency parser) in our applications. We will now see how to customize the statistical models for our custom domain and data.

spaCy models are very successful for general NLP purposes, such as understanding a sentence's syntax, splitting a paragraph into sentences, and extracting some entities. However, sometimes, we work on very specific domains that spaCy models didn't see during training.

For example, the Twitter text contains many non-regular words, such as hashtags, emoticons, and mentions. Also, Twitter sentences are usually just phrases, not full sentences. Here, it's entirely reasonable that spaCy's POS tagger performs in a substandard manner as the POS tagger is trained on full, grammatically correct English sentences.

Another example is the medical domain. The medical domain contains many entities, such as drug, disease, and chemical compound names. These entities are not expected to be recognized by spaCy's NER model because it has no disease or drug entity labels. NER does not know anything about the medical domain at all.

Training your custom models requires time and effort. Before even starting the training process, you should decide whether the training is really necessary. To determine whether you really need custom training, you will need to ask yourself the following questions:

  • Do spaCy models perform well enough on your data?

  • Does your domain include many labels that are absent in spaCy models?

  • Is there a pre-trained model/application in GitHub or elsewhere already? (We wouldn't want to reinvent the wheel.)

Let's discuss these questions in detail.

Do spaCy models perform well enough on our data?

If the model performs well enough (above 0.75 accuracy), then we can customize the model output by means of another spaCy component. For example, let's say we work on the navigation domain and we have utterances such as the following:

Get hands-on with 1200+ tech skills courses.