Related Tags

dna

# How to convert DNA patterns to numbers and vice-versa Ibrahim Khan

In this shot, we’ll take a look at how we can convert patterns to numbers, and vice-versa.

### Pattern to number

The implementation we will be looking at is based on the observation that a lexicographically ordered list remains lexicographically ordered if we remove the last symbol of all the elements in that list. Applying this observation to DNA strings, we notice that every (k-1)-mer is repeated 4 times.

If we have a lexicographically ordered DNA of 3-mers, removing the last symbol will leave a lexicographic order of 2-mers, where they are repeated 4 times.

Let’s take a look at an example. We have symbols A, C, G, and T that correspond to the numbers 0, 1, 2, and 3, respectively. If we convert ACT to numbers, we would get 7. Let’s take a look at the code for this function.

### Code

## Change the symbol to the corresponding number
def symbol_to_number(symbol):
if (symbol == 'A'):
return 0
elif (symbol == 'C'):
return 1
elif (symbol == 'G'):
return 2
elif (symbol == 'T'):
return 3

def pattern_to_number(pattern):
## return if pattern is empty
if not pattern:
return 0

## Convert last symbol to number
last_symbol = symbol_to_number(pattern[len(pattern) - 1])
## recursively convert the remaining pattern to number
prefix_value = pattern_to_number(pattern[:-1])
## return the number corresponding to the pattern
return (4*prefix_value + last_symbol)

print(pattern_to_number("ACT"))

### Explanation

• In line 2, we define the symbol_to_number function, which converts the symbol to the corresponding number as defined in the requirements.
• In line 12, we define the pattern_to_number function, which takes pattern as input.
• In lines 14-15, we define the base case. If pattern is empty, the function will return 0.
• In line 18, we take out the last symbol from pattern and convert it to the corresponding number using the symbol_to_number function.
• In line 20, we add the recursive call to the function, in which we pass pattern without the final character as an argument.
• In line 22, we return the result of the function.

### Number to pattern

To compute the pattern from a number, we need to inverse the function written above. To do this, we will reverse engineer the process of the “Pattern to number” section. Here are the steps:

1. Divide the index by 4 since we multiplied by 4 above.
2. Convert the remainder number to the corresponding symbol.
3. Repeat the above steps on the quotient until the base case gets met (the value of k becomes equal to 1).

Let’s take the same example from above to get a better understanding. Here, the numbers 0, 1, 2, and 3 correspond to A, C, G, and T, respectively. If we input the index as 7 and k (the k in k-mer) as 3 we will get the pattern ACT.

### Code

## Convert numebr to it's corresponding symbols
def number_to_symbol(number):
if (number == 0):
return 'A'
elif (number == 1):
return 'C'
elif (number == 2):
return 'G'
elif (number == 3):
return 'T'

def number_to_pattern(index, k):
## return error if conversion is
## not possible for the provided length
if index >= 4**k:
return "Index value in too large for given k"

## base case when length(k) is 1
if k == 1:
return number_to_symbol(index)

## Get prefix index
prefix_index = int(index/4)
## Get the index of the last character
remainder = index % 4
## Get last symbol
last_symbol = number_to_symbol(remainder)
## recursively get the pattern without the last symbol
prefix_pattern = number_to_pattern(prefix_index, k-1)

return prefix_pattern + last_symbol

print(number_to_pattern(7, 3))

### Explanation

• In line 2, we define the number_to_symbol function, which converts the numbers to the corresponding symbol as defined in the requirements.
• In line 13, we define the number_to_pattern function, which takes index and k as inputs.
• In lines 16-17, we insert a constraint such that index cannot exceed the largest possible value of k.
• In lines 20-21, we have the base case. If the value of k is 1, it will return the number corresponding to the symbol it was provided with.
• In line 24, we calculate the prefix_index, which is the quotient when we divide index by 4.
• In line 26, we get the remainder from the division of index with 4.
• In line 28, we get the last symbol by converting the remainder to its corresponding character.
• In line 30, we add the recursive call for the prefix_index, with length k-1.
• In line 32, we concatenate the prefix_pattern and the last_symbol and return the complete pattern.

RELATED TAGS

dna

CONTRIBUTOR Ibrahim Khan 