Error Model

Understand the importance of an error model in a spellchecker.

We'll cover the following

What is an error model?

In a spellchecker, an error model is a statistical model that estimates the probability of different types of errors that may occur when a user misspells a word. The error model helps the spellchecker generate a list of candidate corrections that are likely to represent the intended word, despite the error.

The specific approach used to develop an error model can vary, but it generally involves analyzing a corpus of text to identify common patterns of errors. For example, an error model might consider the probability that a user will:

  • Misspell a word by adding, deleting, or substituting a letter.

  • Confuse homophones (words that sound the same but have different spellings).

  • Mistype adjacent keys on a keyboard.

  • Misunderstand word boundaries or capitalization.

By analyzing a large corpus of text and identifying patterns of errors, an error model can estimate the likelihood of different types of errors and use that information to generate a list of candidate corrections.

Error models, candidate models, and selection mechanisms (covered in the next section) are very intertwined, and therefore, the error model we choose is likely going to be custom-tailored to our candidate model. For example, if we choose a candidate model that utilizes n-grams, an error model may incorporate different contextualizations of the misspelled word. For example, a trigram that has a pronoun-verb-noun pattern may be more likely than a pronoun-verb-adverb pattern, so select corrections that match the more likely pattern.

In our error model, we will use a very simple model. We will say that all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. So we start with a non-empty list of candidates in the order of the following priority:

  1. The original word, if it is found in the dictionary.

  2. If it exists, the list of known words at edit distance one.

  3. If it exists, the list of known words at edit distance two.

  4. The original word, even though it is not known.

Implementation

Below we will implement our error model, specifically a single function, candidates(word) in which we will utilize our existing solution along with this new function that should be implemented according to the above pseudocode. You will need the function known(words) which was implemented during the candidate model.

Get hands-on with 1400+ tech skills courses.