Document-Term Matrix

Learn about how a document-term matrix is a commonly accepted data structure for natural language processing.

We'll cover the following

A document-term matrix is fairly simple to understand. It is a matrix with rows and columns.

  • Each row represents a document. In our case, there will be one row for Frankenstein and a second row for The Last Man.

  • Each column represents a term. In this case, terms are words, although they can be sentences, lines, paragraphs, or n-grams (more on these in a later lesson).

  • Each cell in the matrix contains the frequency of the term in the document.

On the other hand, a term-document matrix (TDM) is a data structure that is essentially the transpose of a document-term matrix. It also consists of rows and columns but with a different arrangement.

  • Each row in a TDM represents a term. For instance, if we have a set of terms like love, hate, joy, and so forth, each term will have its own row in the matrix.

  • Each column in the TDM represents a document. For example, if we have documents named Document 1, Document 2, and Document 3, these will be represented as separate columns in the matrix.

  • Each cell in the term-document matrix contains the frequency of the term in the corresponding document.

This matrix is often used to examine the occurrence of specific terms across various documents, enabling researchers to gain insights into how certain terms are distributed throughout the collection of documents.

Understanding the DTM

The easiest way to understand a document-term matrix is to look at one. Here’s the code to create a DTM from the corpus we’ve been working with.

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy