Question

Design the backend system for a top N trending topics feature.

Background

Similar to previous system design questions, this question is ambiguous and open-ended. It also involves streaming data, which is a nice tie-in with the previous question.

Solution Approach

We’ll follow the same approach as we did for the previous question:

Define system requirements:

Before we begin any design, we’ll spend a few minutes agreeing on the one to two primary use cases you’ll need to cover.
System breakdown:

We’ll then diagram the key individual components of the system needed to address the one to two primary use cases.
Dataflow discussion:

We’ll discuss the necessary data points the system will need to collect and how they flow through the system.
Scaling the design:

We’ll talk about design choices and trade-offs to ensure that the system scales.
Capacity modeling:

We’ll estimate the request volume and data volume the system will need to handle for a given time period.

Sample answer

System requirements

Before we begin defining the system requirements, we need to make some clarifying assumptions:

The system will receive as input a continuous stream of “topics” - key words or phrases that the system will rank and emit the top N of. The system will not need to parse or delimit topics. It only needs to aggregate them. We can assume the topics will be well-ordered by timestamp.
The top N topics will be determined by simply counting the number of times a topic appears in the stream of topics and then returning the top N by this count.
There will be a window of time over which the system will need to aggregate the top topics (i.e., identify the top N “topics” over the last 24 hours).

From these assumptions, we can define the key requirement of the system:

Find the top N topics from a continuous stream of topics over a window of time:

From the stream of topics, we will need to count the occurrence of each topic for a window of time, rank each “topic” by frequency of occurrence, and provide the top N “topics” to downstream consumers.

Let’s breakdown the role of each of these components in the overall system.

Topics Stream: This is the continuous stream of topics that will be inputs to our system.
Topics Aggregator: This is the meat of the system. Here the system will count the frequency of each topic appearing in the Topics Stream for a period of time. This will be done by using a sliding window approach. Here is how it works:
- The Topics Aggregator is divided into slots with each slot corresponding to a unit of time (i.e., a minute).
- Each slot consists of a dictionary of key-value pairs where the key is the topic and the value is the count of occurrences of that topic in that slot.
- As topics flow in, they will be counted in the slot corresponding to the unit of time since system start time. For example, with a slot size of one, topics entering at t<=1 (where t is the amount of time past since the start of the system) will be counted in Slot 1. Those entering at 1<t<=2 will be counted in Slot 2, and so forth. Topics will be written as a key to the slot’s dictionary the first time they enter that respective slot with future duplicates incrementing the count of that topic.
  - One obvious concern is the number of slots we will retain versus delete since the volume of data is not constrained and will continue to grow. Choosing the right number of slots is not in scope of this exercise, but a solution to re-using slots is proposed in the Scaling the Design section.
- For a given period Y and window size N, the N slots starting from the first slot in the current window will be sent to the Topics Ranker for every time period Y. The start of the next window will then increment by time period Y (hence a “sliding” window).
  - For example, assuming each slot corresponds to one minute of time, and if Y=1 minute and N=3 minutes, then every one minute the current three slots in the window will be sent to the Topics Ranker. The first window sent to the Topics Ranker will correspond to the first three slots at t=3 since this is when sufficient time has passed for all three slots to be filled with data. This image demonstrates how this will work:

Dataflow discussion

Now let’s put everything together and discuss how data will flow through the system. We will assume a slot size of 1, a window size of N, and a period of Y.

Topics will flow into the system on a continuous basis via the Topics Stream.
The Topics Stream will increment the frequency count of the topic in the appropriate slot in the Topics Aggregator based on the current time since system inception. For example, events entering between t=0 and t=1 will be counted in Slot 1, and those between t=1 and t=2 to Slot 2, and so forth. Topics will be written as a key to the slot’s dictionary the first time they enter that respective slot with future duplicates incrementing the count of that topic.
Each slot in the Topics Aggregator corresponds to a window of size N. A window of Size 1 will correspond to one slot, Size 2 will correspond to two slots, and so forth.
At regular intervals of period Y, the Topics Aggregator will push all the slots in the current window to the Topics Ranker and increment or “slide” the window by Y.
The Topics Ranker will aggregate frequency counts of all the topics in the window, sort by frequency, and make available the top N topics to the Downstream Consumers.

Scaling the design

Flexible capacity and redundancy:

Having the ability to quickly obtain and then release storage and processing capacity will allow the system to handle spiky increases in the Topics Stream. This will also enable the system to ensure redundancy by obtaining the necessary additional resources to ensure seamless failover in case of system failure.
Rate limiting:

The Topics Aggregator may need to impose a rate limit on topics written if there is a massive increase in the write volume and in a short amount of time.
Circular slots in the Topics Aggregator:

To reduce storage consumption, we may impose a limit on the number of slots and allow slots to be overwritten in a circular fashion. For example, if we have six slots in the Topics Aggregator with a slot size of 1, then we will overwrite the topics in Slot 1 for events entering between t=6 and t=7, and overwrite the topics in Slot 2 for events entering between t=7 and t=8, and so forth. The trade-off with this approach is that we will have lower data retention for older topics past a certain date (this is acceptable given that “trending” topics are usually only relevant at a very specific point in time).
Pull-based vs. ...

What is Technical Program Management?

Interview Prep (a.k.a. Skills Development)

Practical Advice

Career Transitions

Final Thoughts

System Design: Designing a Trending Feature