Counting Words

Learn about frequent substrings.

We'll cover the following

Identifying frequent words

Operating under the assumption that DNA is a language of its own, let’s borrow Legrand’s method and see if we can find any surprisingly frequent “words” within the ori of Vibrio cholerae. We’ve added reason to look for frequent words in the ori because for various biological processes, certain nucleotide strings appear surprisingly often in small regions of the genome. This is because certain proteins can only bind to DNA if a specific string of nucleotides is present, and if there are more occurrences of the string, then it’s more likely that binding will successfully occur. (It’s also less likely that a mutation will disrupt the binding process.)

For example, ACTAT is a surprisingly frequent substring of:

ACAACTATGCATACTATCGGGAACTATCCT

We use the term k-mer to refer to a string of length k and define Count(Text, Pattern) as the number of times that a k-mer Pattern appears as a substring of Text. Following the above example:

Count(ACAACTATGCATACTATCGGGAACTATCCT, ACTAT) = 3

Get hands-on with 1200+ tech skills courses.