What is HyperLogLog in Redis?

HyperLogLog is a data structure available in Redis that is used to count the number of unique elements in a set using a small, constant amount of memory. The element count is approximated with a standard error of 0.81%.

PFADD, PFCOUNT, and PFMERGE

The HyperLogLog data structure is available in Redis via the PFADD, PFCOUNT, and PFMERGE commands.

PFADD adds all the arguments provided to the HyperLogLog data structure with the key specified as the first argument.

If the item count estimated by HyperLogLog is changed after the command is executed, PFADD returns 1; otherwise, it returns 0.

PFCOUNT returns the approximate item count stored by the HyperLogLog data structure that is specified by the key provided as the first argument.

If multiple keys are provided, PFCOUNT internally merges the HyperLogLogs stored at the provided keys into a temporary HyperLogLog. Then, it returns the approximate count of the union of HyperLogLogs passed.

> PFADD users John Ted Ray Andy
(integer) 1
> PFCOUNT users
(integer) 4

PFMERGE merges multiple source HyperLogLog values into a unique value that will approximate the count of the union of the sets of the source HyperLogLog data structures.

The example below merges the HyperLogLogs, represented by months and years, into time:

> PFADD months jan feb mar
(integer) 1
> PFADD years 2011 2012 2013 2014
(integer) 1
> PFMERGE time months years
"OK"
> PFCOUNT time
(integer) 7

More on HyperLogLog

If we want to know the exact number of unique items in a set, you need memory that is proportional to the number of items in the set. In bigger sets, the space complexity becomes very large.

For example, if we need to count unique email addresses in a set of 10 million email addresses, where half of them are unique, the amount of memory sufficient to hold 5 million unique email addresses is required.

HyperLogLog solves this problem with little memory using a randomization algorithm. The inventor of this algorithm was Philippe Flajolet, which is why we have PF as a prefix for all the REDIS Hyperloglog commands.

HyperLogLog Algorithm

The HyperLogLog algorithm works on uniformly distributed random numbers and is used to estimate the count of unique items in a set. The key concept behind this algorithm is that if we count the maximum number of leading zeros in the set items (hashed to uniform random numbers and represented in binary), we can estimate the number of unique items with a simple calculation.

If the maximum number of leading zeros is $n$ , the estimated cardinality of the set is $2^n$ .

Let’s assume that we need to count the unique visitors for a web page, and that we have a large set containing IP addresses of all requests.

We will first convert the input IP addresses to a set of uniformly distributed random numbers using a Hash function. The cardinality doesn’t change here as only the IP addresses are converted to uniformly distributed random numbers.
These random numbers are divided into different subsets using the initial few bits. The number of maximum leading zeros, within its values, are stored in memory.
We calculate the harmonic mean of estimates for all the previously computed subsets.
The harmonic mean computed above is an estimation of the unique visitor count on the web page.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)