Training a Pipeline Component From Scratch
Explore how to create and train a new named entity recognition (NER) pipeline component in spaCy tailored for the medical domain. This lesson teaches data preparation, annotation, model initialization, and training on both small and real-world medical datasets, giving you hands-on experience in building a custom NER model to accurately recognize medical entities like diseases and drugs.
We'll cover the following...
Previously, we saw how to update the existing NER component according to our data. In this lesson, we will create a brand-new NER component for the medicine domain.
Let's start with a small dataset to understand the training procedure. Then we'll be experimenting with a real medical NLP dataset. The following sentences belong to the medicine domain and include medical entities such as drug and disease names:
Methylphenidate/DRUG is effectively used in treating childrenwith epilepsy/DISEASE and ADHD/DISEASE.Patients were followed up for 6 months.Antichlamydial/DRUG antibiotics/DRUG may be useful for curingcoronary-artery/DISEASE disease/DISEASE.
The following code block shows how to train an NER component from scratch. As we mentioned before, it's better to create our own NER rather than updating spaCy's default NER model as medical entities are not recognized by spaCy's NER component at all. Let's see the code and also compare it to the code done previously. We'll go step by step:
In the first three lines, we made the necessary imports. We imported
spacyandspacy.training.Example. We also importedrandomto shuffle our dataset:
We defined our training set of three examples. For each example, we included a sentence and its annotated entities:
train_set = [("Methylphenidate is effectively used intreating children with epilepsy and ADHD.", {"entities":[(0, 15, "DRUG"), (62, 70, "DISEASE"), (75, 79,"DISEASE")]}),("Patients were followed up for 6months.", {"entities": []}),("Antichlamydial antibiotics may beuseful for curing coronary-artery disease.", {"entities":[(0, 26, "DRUG"), (52, 75, "DIS")]})]
We also listed the set of entities we want to recognize—
DISfor disease names, andDRUGfor drug names:
entities = ["DIS", "DRUG"]
We created a blank model. This is different from what we did in the previous section. In the previous section, we used spaCy's pre-trained English language pipeline:
nlp = spacy.blank("en")