Frequency Analysis—Theory vs Practice

A study of theory versus practice

Before leaving the topic of monoalphabetic ciphers, it’s worth using letter frequency analysis of the simple substitution cipher to illustrate a point that we’ll keep returning to throughout our investigation of cryptography—the differences between theory and practice.

Theory: Uniqueness of the plaintext

We’ve just observed that the simple substitution cipher can provide reasonable protection for very short plaintexts. As an example, consider plaintexts consisting of just three letters. With only three ciphertext characters to work with, an attacker is faced with so many possible three-letter plaintexts that could be encrypted into a given three-letter ciphertext that it’s probably fair to describe the simple substitution cipher as being unbreakable.

To illustrate this, if we are given a three-letter ciphertext MFM, then letter frequency analysis is useless, but we do know the first and the third plaintext letter must be the same. The plaintext could be BOB, or POP, or MUM, or NUN, and so on.

However, given a reasonable length of ciphertext, we know letter frequency analysis becomes very effective. So how much ciphertext does it take for the apparently hard problem of decrypting a short ciphertext to transform into the easy problem of decrypting a longer ciphertext?

Although there is no simple answer to this question, an important observation is that as the number of ciphertext letters increases, the number of possible plaintexts that could have resulted in that ciphertext must decrease. At some point, this number will decrease to the point where only one plaintext is possible. The obvious question is how many letters do we need before only one plaintext is possible?

For the simple substitution cipher applied to English plaintexts, this number is usually regarded as being around 28 ciphertext letters. We can reasonably assume the following:

  1. If we have significantly less than 28 ciphertext letters, then there are probably many meaningful plaintexts that could have resulted in the ciphertext.

  2. As we approach 28 ciphertext letters, then the number of possible meaningful plaintexts which could have resulted in the ciphertext steadily decreases.

  3. Once we have 28 ciphertext letters, we can be fairly sure there is only one meaningful plaintext that could have resulted in the ciphertext.

  4. If we have hundreds of ciphertext letters, then it’s virtually a certainty that there is only one meaningful plaintext that results in the ciphertext.

Practice: statistical information

Our previous discussion was all about what is possible in theory. It doesn’t necessarily tell us what can happen in practice. If we have 28 ciphertext characters generated by a simple substitution cipher with the underlying plaintext language of English, then there’s probably only one possible plaintext that could have resulted in this ciphertext. But can it be found in practice?

The answer is, frustratingly, probably not. The effectiveness of letter frequency analysis increases with the amount of ciphertext available, but 28 letters are generally not enough statistical information. In practice, some people suggest that for English plaintexts, at least 200 ciphertext letters are needed in order to be fairly confident that the letter frequency statistics will be reliable enough to conduct an effective letter frequency analysis, although it will often work with fewer letters than this.

The gap between theory and practice

There is a significant ‘gap’ between theory and practice. If we have between 28 and 200 ciphertext characters, then there will almost certainly only be one meaningful plaintext that results in the target ciphertext, but it will probably be difficult to determine. The situation we have just discussed is summarized in the table below:

Get hands-on with 1200+ tech skills courses.