Alternative Approaches

Learn the significance of alternative approaches for extracting information from documents.

Bag of words

Tokenizing—or breaking a document into units—is simple to understand when tokens are just words from the document. This is often called a “bag of words.” However, this method has problems, such as a lack of context. It’s a simple way of looking at a document, but there are other, more sophisticated strategies.

More sophisticated approaches

  • N-grams: Words have meanings that depend on context. For example, the word “bank” can mean different things: “Deposit money in the bank” or “A bird is nesting on the bank of the river.” For another example, is Mary Shelley discussing Victor Frankenstein or Frankenstein’s monster? Phrases or combinations of words are important.

  • Stemming: Words can indicate the same thing but have modifiers that will flag them as different words. For example, “Complete,” “Completely,” and “Completing” can express similar ideas but will appear separately when tokenized as single words. Frankenstein’s monster can be “scaring,” “scary,” “scared,” “scariest,” or “scarier.”

  • Lemmatization: Words can indicate the same thing but be completely different. For example, “sphere,” “ball,” “globe,” and “orb” can all mean the same type of object but are entirely different words. In some cases, they should be counted as the same concept. Is Frankenstien’s monster “tall,” “huge,” or “massive?”

  • Parts of speech: Words have meaning based on their location in a sentence. Consider the word “watch.” If we were to say, “He owns a watch,” then we’d be using “watch” as a noun. If we were to say, “She is going to watch a movie,” then we’d be using “watch” as a verb. It’s the same word but with different meanings, depending on how it is used.

  • The tf-idf: Some words gain importance because of their scarcity within a collection of documents. The word “and” appears multiple times in many documents and so has little value for the identification of a unique document. The word “galvanism” rarely appears and so can be used as an indicator of a source document.

Get hands-on with 1200+ tech skills courses.