Detour: Probabilities of Patterns in a String

Explore how probabilities of patterns in a string are calculated, how changing the pattern changes the probability, and what the overlapping words paradox is.

We mentioned that the probability that some 9-mer appears 3 or more times in a random DNA string of length 500 is approximately 1/1300. We assure you that this calculation doesn’t appear out of thin air. Specifically, we can generate a random string modeling a DNA strand by choosing each nucleotide for any position with a probability 1/4. The construction of random strings can be generalized to an arbitrary alphabet with A symbols, where each symbol is chosen with probability 1/A.

Exercise Break: What is the probability that two randomly generated strings of length n in an A-letter alphabet are identical?

Now, there’s a simple question: what’s the probability that a specific k-mer Pattern will appear (at least once) as a substring of a random string of length N? For example, say that we want to find the probability that “01” appears in a random binary string (A = 2) of length 4. Here are all possible such strings:

Get hands-on with 1200+ tech skills courses.