What is CUME_DIST() in PostgreSQL?

A database is a collection of structured data stored in an organized manner. PostgreSQL is a relational database management system that employs SQL for fundamental operations such as creating tables, inserting data into tables, modifying the records, and deleting the existing data from the tables. Understanding SQL keywords is crucial for being skilled in SQL. In this context, an important keyword,These calculate values for each row in a query result set based on a specified window of rows. CUME_DIST(), is a window functionIt calculates values for each row in a query result set based on a specified window of rows. that returns the cumulative distribution of a value in a group of data values (dataset).

The `CUME_DIST()` function

The CUME_DIST() computes a cumulative value based on the rows in a dataset. It also works with datasets in the partitions. Its return values range from greater than 0 and less than or equal to 1. The CUME_DIST() function is particularly used in scenarios where you need to understand the relative standing of a value within a dataset, such as finding percentilesA percentile is a value on a scale of 100 that compares the current row's value to the other rows' values in the same dataset. Percentiles in decimal form range from 0 to 1.. Suppose we have student scores in different subjects: CUME_DIST() shows the percentage of students who scored less than or equal to a given score, which is useful for understanding cumulative standing.

Syntax

Following is the syntax for CUME_DIST():

The CUME_DIST() represents the window function.
We create partitions by defining the column/condition in the PARTITION BY clause. This clause becomes optional if there’s no need to create partitions. In such cases, the window function will treat the dataset as a single window.
The ORDER BY clause specifies the order within each partition.

Coding examples

Let’s understand the CUME_DIST() window function through coding examples.

Example table

We have some data on the student’s exam results in the Exam_result table that we can view using the SELECT query.

Note: We have modified the records in each example to understand the concept clearly.

`CUME_DIST()` of a single dataset

The CUME_DIST() returns a proportion value based on the preceding rows in datasets. It also works with partitions of a dataset. The return value ranges between 0 and 1.

A value of 0.2 for a row in ascending order means that the current row’s value is less than or equal to 20% of the values in the dataset or partition (if applicable). In other words, 20% of the rows have a value less than or equal to the current row’s value. Conversely, in descending order, 0.2 means that the current row's value is greater than or equal to 20% of the values in the dataset or partition (if applicable). Thus, 20% of the rows have a value greater than or equal to the current row’s value.

Note: Check out the output in descending order by replacing ASC with DESC in the ORDER BY clause in the above code. Also, the values are trimmed for this example.

`CUME_DIST()` of a partitioned dataset

As the return value of the function CUME_DIST() ranges from 0 to 1, it generates the values within the limit for each window of the data.

Let’s partition the data based on the subject names. We’ll have three partitions according to the available data. Click the “Run” button in the following playground to view the output:

You may see that the students with the Science subject have the same value, 1, when there is a tie for the lowest score. Also, this function assigns 0.4 to all the records if there is a tie on the highest score, as we can see for students with the English subject.

Note: Check out the output in descending order by replacing ASC with DESC in the ORDER BY clause in the above code.

`CUME_DIST()` for `NULL` values

The CUME_DIST() function considers NULL values as well. It assigns the lowest value to the record with NULL value if we sort the rows in descending order and 1 if we sort the rows in ascending order. Following is the example of the output sorted in ascending order.

`CUME_DIST()` vs. `PERCENT_RANK()`

In PostgreSQL, CUME_DIST() and PERCENT_RANK() are similar ranking functions: CUME_DIST() shows the percentage of students who scored less than or equal to a given score, which makes it useful for understanding cumulative standing while PERCENT_RANK() shows the rank of a student’s score relative to others, which makes it useful for comparing individual performance. To discuss the difference between CUME_DIST() and PERCENT_RANK(), we must understand the percentiles first. A percentile is a value on a scale of 100 that compares the current row’s value to the other rows’ values in the same dataset. Percentiles in decimal form range from 0 to 1. Let’s look at the distinct behaviors of CUME_DIST() and PERCENT_RANK() functions.

In PostgreSQL, CUME_DIST() calculates the cumulative distribution of a value within a dataset, including the current row, while PERCENT_RANK calculates the relative rank within a dataset, excluding the current row.
Both functions CUME_DIST() and PERCENT_RANK() have a maximum value of 1. The highest possible value for both functions is 1, which indicates the highest rank or the full distribution.
The function PERCENT_RANK() doesn’t assign 1 to any of the values if we have ties. In contrast, CUME_DIST() assigns the value 1 to all the records with maximum values.
The range for PERCENT_RANK() starts from 0, which indicates the lowest possible rank. On the other hand, CUME_DIST() does not start from 0; instead, it assigns a minimum value to the record. This minimum value represents the cumulative distribution of the record within the dataset or partition.

Let’s have a look at the following example for more clarification.

Conclusion

To understand the CUME_DIST() window function in PostgreSQL, we have seen a few examples that use this function. These examples include single, partitioned datasets with duplicate values and datasets with NULL values. In the end, we have also compared it with PERCENT_RANK(). Through detailed examples showcasing both ascending and descending order scenarios, this Answer serves as a practical guide. It also empowers users to leverage percentiles effectively in SQL for diverse data analysis needs.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources