Training Data Generation
There are two approaches you can adopt to gather training data for the entity linking problem.
- Open-source datasets
- Manual labeling
You can use one or both depending on the particular task for which we have to perform entity linking.
If the task is not extremely domain-specific and does not require very specific tags, you can avail open-source datasets as training data. For example, if you were asked to perform entity linking for a simple chatbot, you could utilize the general-purpose, open-source dataset CoNLL-2003 for named-entity recognition.
CoNLL-2003 is built on the Reuters Corpus which contains 10,788 news documents totalling 1.3 million words. It contains train and test files for English and German languages and follows the IOB tagging scheme.
📝 IOB tagging scheme
I- An inner token of a multi-token entity
O- A non-entity token
B- The first token of a multi-token entity; The B-tag is used only when a tag is followed by a tag of the same type without “O” tokens between them. For example, if for some reason the text has two consecutive locations (type LOC) that are not separated by a non-entity
The following are some snippets from the train and test files of CoNLL dataset.