Getting the Visits into Unique Counts
Compute a single HyperLogLog value per message per day using hll.
We'll cover the following
Context and constraints
In a production setup, we would have the following context and constraints:
-
Tweets are published, and users from the internet are visiting tweets.
-
Our application inserts a new row in
tweet.visitor
with the visitor’s IP address each time there is a new visit to one of our Tweets. It also registers the precise timestamp of the visit. -
As we anticipate quite some success on our little application idea, we also anticipate not being able to keep all the visitor logs and not being able to respect our quality-of-service terms when computing the unique visitors on the fly each time someone needs them.
-
Finally, as numbers are being used in a marketing context rather than in an invoicing context, we’re in a position to lose some precision over the number, and we would actually like to implement a system that is lossy if it allows us to relax our storage and processing requirements.
Computing a single hyperloglog
From the tweet.visitor
table, we’re now going to compute a single HyperLogLog value per message and per day:
Get hands-on with 1200+ tech skills courses.