Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

suffix tree

What is a suffix tree?

Rukhshan Haroon

A suffix tree is a tree data structure typically used to store a list of strings. It is also referred to as the compressed version of a trie, as, unlike a trie, each unique suffix in the list is compressed together and represented by a single node or branch in a suffix tree.

There are many ways to construct a suffix tree, but the semantics that is shared by most if not all types of suffix trees are as follows:

  • A special character is appended to each sub-string.
  • Each leaf node contains the starting position or index of the suffix it represents.
  • The alphabets of any suffix are compressed and represented by a single node.

Before we learn about suffix trees, it is important to know the difference between a trie and a suffix tree.

Trie

In a trie, each alphabet of all the strings in the prescribed string is parsed one by one and represented by a single node. If two or more words start with the same sub-string, the identical sub-string is represented by the same chain of nodes. The chain breaks where the sub-string ends and the unique suffix begins. Subsequently, each alphabet of the remaining suffix is represented by a separate node.

The list below contains the names of seven animals:

%0 node_1624536285246 MACAW node_1624536298987 PANDA node_1624536206425 LEAOPARD node_1 RAT node_2 RHINO node_3 PARROT node_1624536288079 MACKERAL

A simple illustration of the trie for our list of animal names would look like this:

%0 node_1 node_2 R node_1->node_2 node_3 P node_1->node_3 node_4 L node_1->node_4 node_1624534744952 M node_1->node_1624534744952 node_1624534837748 A node_2->node_1624534837748 node_1624534582369 H node_2->node_1624534582369 node_1624534669344 A node_3->node_1624534669344 node_1624534713755 E node_4->node_1624534713755 node_1624534749466 A node_1624534744952->node_1624534749466 node_1624534856059 T node_1624534837748->node_1624534856059 node_1624534586516 I node_1624534582369->node_1624534586516 node_1624534671900 N node_1624534669344->node_1624534671900 node_1624534880795 R node_1624534669344->node_1624534880795 node_1624534716255 O node_1624534713755->node_1624534716255 node_1624534751419 C node_1624534749466->node_1624534751419 node_1624534598794 N node_1624534586516->node_1624534598794 node_1624534676503 D node_1624534671900->node_1624534676503 node_1624534887720 R node_1624534880795->node_1624534887720 node_1624534718798 P node_1624534716255->node_1624534718798 node_1624534753722 A node_1624534751419->node_1624534753722 node_1624535005526 K node_1624534751419->node_1624535005526 node_1624534607207 O node_1624534598794->node_1624534607207 node_1624534678898 A node_1624534676503->node_1624534678898 node_1624534890128 O node_1624534887720->node_1624534890128 node_1624534722225 A node_1624534718798->node_1624534722225 node_1624534756013 W node_1624534753722->node_1624534756013 node_1624535009322 E node_1624535005526->node_1624535009322 node_1624534893729 T node_1624534890128->node_1624534893729 node_1624534724775 R node_1624534722225->node_1624534724775 node_1624535011907 R node_1624535009322->node_1624535011907 node_1624534727499 D node_1624534724775->node_1624534727499 node_1624535015639 A node_1624535011907->node_1624535015639 node_1624535018699 L node_1624535015639->node_1624535018699

Suffix Tree

As mentioned earlier, each unique suffix in the list is compressed together in a suffix tree. An illustration of the suffix tree of the same list of animal names that we used earlier would look like this:

%0 node_1 node_1624538589865 $ node_1->node_1624538589865 node_2 R node_1->node_2 node_3 PA node_1->node_3 node_4 LEOPARD$ node_1->node_4 node_1624535714132 MAC node_1->node_1624535714132 node_1624535766298 AT$ node_2->node_1624535766298 node_1624535781347 HINO$ node_2->node_1624535781347 node_1624538605458 $ node_2->node_1624538605458 node_1624538628014 $ node_3->node_1624538628014 node_1624535804761 NDA$ node_3->node_1624535804761 node_1624535795475 RROT$ node_3->node_1624535795475 node_1624538638250 $ node_1624535714132->node_1624538638250 node_1624535844028 AW$ node_1624535714132->node_1624535844028 node_1624535861550 KERAL$ node_1624535714132->node_1624535861550

Typically, we do not want an implicit suffix tree, and terminating each node with any special character helps us avoiding one. In our examples, we use the character.

Usage

The application of suffix trees is diverse and inter-disciplinary in nature.

In Computational Biology, suffix trees are widely used to identify the repeating structures in a DNA molecule. Similarly, it may be used to find the longest common sub-string or sub-sequence in a DNA sequence. These techniques are vital to the study of evolution and to trace similarities between organisms.

Moreover, in Forensic Science, it is crucial to make sure that DNA samples are not contaminated. Using suffix trees, analysts can verify if a given DNA sequence is contaminated or not!

RELATED TAGS

suffix tree

CONTRIBUTOR

Rukhshan Haroon
Copyright ©2022 Educative, Inc. All rights reserved
RELATED COURSES

View all Courses

Keep Exploring