BioBERT

Learn about the BioBERT domain-specific BERT model and how to pre-train and fine-tune it for NER and question-answering tasks..

As the name suggests, BioBERT is a biomedical domain-specific BERT model pre-trained on a large biomedical corpus. Since BioBERT understands biomedical domain-specific representations, once pre-trained, BioBERT performs better than the vanilla BERT on biomedical texts. The architecture of BioBERT follows the same as the vanilla BERT model. After pre-training, we can fine-tune BioBERT for many biomedical domain-specific downstream tasks, such as biomedical question answering, biomedical named entity recognition, and more.

Pre-training the BioBERT model

BioBERT is pre-trained using biomedical domain-specific texts. We use the biomedical datasets from the following two sources:

  • PubMed: This is a citation database. It includes more than 30 million citations for biomedical literature from life science journals, online books, and MEDLINE (an index of the biomedical journal, the National Library of Medicine).

  • PubMed Central (PMC): This is a free online repository that includes articles that have been published in biomedical and life sciences journals.

BioBERT is pre-trained using PubMed abstracts and PMC full-text articles. The PubMed corpus consists of about 4.5 billion words, and the PMC corpus consists of about 13.5 billion words. We know that the general BERT is pre-trained using a general domain corpus that is made up of the English Wikipedia and Toronto BookCorpus datasets, so before directly pre-training BioBERT, first, we initialize the weights of BioBERT with the general BERT model, and then we pre-train BioBERT with the biomedical domain-specific corpora.

Get hands-on with 1200+ tech skills courses.