UPGMA (unweighted pair group method with arithmetic mean) uses the average distance between pairs of taxa (in simple terms, taxa are the items or groups you’re trying to compare and organize. Imagine you’re sorting animals like cats, dogs, and birds based on how similar they are) to build a tree, while WPGMA (weighted pair group method with arithmetic mean) gives more weight to the more similar pairs of taxa, which can affect the tree structure.
Unweighted pair group method with arithmetic mean (UPGMA)
Key takeaways:
UPGMA (unweighted pair group method with arithmetic mean) is a hierarchical clustering method used in bioinformatics for constructing evolutionary trees by progressively merging clusters based on pairwise distance metrics among sequences.
The process involves calculating a pairwise distance matrix, merging the closest clusters, updating the distance matrix using the arithmetic mean, and repeating these steps until all sequences are grouped into a single hierarchical tree.
While UPGMA is efficient and useful for analyzing moderate to high divergence datasets, it assumes a constant evolutionary rate and may not be suitable for highly divergent datasets.
UPGMA (unweighted pair group method with arithmetic mean) is a hierarchical clustering method commonly used in bioinformatics, particularly in phylogenetics, for constructing evolutionary trees based on molecular sequence data. It is a bottom-up agglomerative clustering algorithm that builds a tree by progressively merging clusters (groups of sequences) based on their pairwise distances.
How does UPGMA work?
Pairwise distance matrix: The first step in UPGMA involves calculating the pairwise distances between all sequences in the dataset. These distances can be based on various metrics such as genetic distances, sequence similarities, or dissimilarities.
Initialization: Initially, each sequence is considered an individual cluster (leaf) in the tree, and the pairwise distances between them form the initial distance matrix.
Cluster merging: At each iteration, UPGMA identifies the two closest clusters based on the pairwise distance matrix and merges them into a new cluster. The distance between the new cluster and other clusters is calculated using the arithmetic mean of the pairwise distances between the sequences in the merged clusters.
Updating distance matrix: After merging clusters, the distance matrix is updated to reflect the new distances between the merged cluster and the remaining clusters. The new distance is calculated through the formula:
Repeat: Steps 3 and 4 are repeated until all sequences are clustered into a single group, forming a complete hierarchical tree structure.
Tree construction: The hierarchical tree structure obtained from the clustering process represents the evolutionary relationships between the sequences. The height of each node in the tree represents the distance between the clusters it connects, and the branching pattern reflects the sequence similarity or dissimilarity.
Let’s see an example!
Example of UPGMA
Imagine a research study focused on understanding the evolutionary relationships between different species of birds based on their DNA sequences. Researchers collect DNA samples from various bird species and sequence-specific genetic markers to compare their sequences.
After collecting the data, they calculate the pairwise distances between the DNA sequences of all the sampled bird species, and they represent the distances as the following matrix:
Now they will perform clustering analysis with UPGMA on the matrix. The main essence is to progressively cluster the bird species based on their distances, merging the closest species into clusters, updating the distance matrix and ultimately constructing a hierarchical tree structure.
Let’s apply UPGMA clustering on the matrix above!
Step 1: Choose the smallest distance
The first thing we will do is choose the two species with the smallest distance between them, in our case, species A and B, with a distance of 2.
After finding the smallest distance, we will cluster the two species together like so:
Step 2: Update the distance matrix
Now that we have a cluster of AB formed, we will calculate the distance of the cluster AB with all other species (C, D, E, F). The equation to calculate the distance is:
Let’s calculate the distance of our cluster with every other species:
Distance AB with C:
Distance AB with D:
Distance AB with E:
Distance AB with F:
The updated distance vector matrix will be as follows:
Now we have an updated matrix, all we have to do is repeat the above steps. The above two steps were the first cycle, and let’s move on to the second one!
Second cycle
In the updated matrix, we have the option of two clusters as there is a tie between the minimum distance. We can either make a cluster of ED or we can make a cluster of ABC. Let’s make one with ED for simplicity:
Next, let’s calculate the distances of other species from DE:
Distance DE with AB:
Distance DE with C:
Distance DE with F:
Let’s update the distance matrix with the new distances:
Let’s repeat the process again!
Third cycle
Now, the smallest distance is between AB and C, so let’s create a cluster of ABC:
We will calculate the distance from ABC with other nodes now:
Distance ABC with DE:
Distance ABC with F:
Let’s update the weight matrix, too, now:
Our last cycle is left, so let’s do that!
Fourth cycle
Now, the smallest distance is between ABC and DE, so let’s create a cluster of ABCDE:
We will calculate the distance from ABCDE with other nodes now:
Distance ABC with DE:
Let’s update the weight matrix, too, now:
Now we have no more nodes left for clustering, so we can create our final tree!
Create a final hierarchal tree
The final hierarchal tree, or the phylogenetic tree, will contain similarity information between the species. We have created the trees in parts, so let’s merge them all:
Conclusion
UPGMA’s simplicity and computational efficiency make it a valuable method for analyzing large datasets with moderate to high sequence divergence. However, it’s important to note that UPGMA assumes a constant evolutionary rate across sequences and may not be suitable for highly divergent datasets. Through its application in diverse fields such as evolutionary biology, genetics, and ecology, UPGMA continues to contribute to our knowledge of the natural world and the intricate processes underlying evolutionary change.
Quiz
Test your knowledge from the quiz below.
What is the primary application of UPGMA in bioinformatics?
Data compression
Constructing evolutionary trees based on molecular sequence data
Predicting gene function
Aligning DNA sequences
Frequently asked questions
Haven’t found what you were looking for? Contact Us
What is the difference between UPGMA and wpgma?
What is the function of UPGMA?
Is UPGMA rooted or unrooted?
Free Resources