Training Data Generation

Let's generate training data for the entity linking problem.

We'll cover the following

There are two approaches you can adopt to gather training data for the entity linking problem.

  1. Open-source datasets
  2. Manual labeling

You can use one or both depending on the particular task for which we have to perform entity linking.

Open-source datasets

If the task is not extremely domain-specific and does not require very specific tags, you can avail open-source datasets as training data. For example, if you were asked to perform entity linking for a simple chatbot, you could utilize the general-purpose, open-source dataset CoNLL-2003 for named-entity recognition.

CoNLL-2003 is built on the Reuters Corpus which contains 10,788 news documents totalling 1.3 million words. It contains train and test files for English and German languages and follows the IOB tagging scheme.

📝 IOB tagging scheme
I - An inner token of a multi-token entity
O - A non-entity token
B - The first token of a multi-token entity; The B-tag is used only when a tag is followed by a tag of the same type without “O” tokens between them. For example, if for some reason the text has two consecutive locations (type LOC) that are not separated by a non-entity

The following are some snippets from the train and test files of CoNLL dataset.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.