6 Tips from a System Design Expert: How Twilio Builds for Scale

Understand six expert takeaways on building for scale, based on insights from Twilio’s former VP, Mark Gilbert. Unlock key principles for architecture, microservices, and planning for system lifespans.

We'll cover the following...

Twilio and its systems
Six takeaways about System Design and architecting for scale
Conclusion

The importance of System Design in the current tech landscape cannot be overstated.

Companies require software systems that are resilient, reliable, performant, and scalable to effectively serve customers and achieve their business objectives. As a result, System Design has become essential to the software development process.

This lesson draws its insights from Mark Gilbert, former VP of Product Management at Twilio, who shared his experiences scaling the platform. Whether a learner is new to System Design or already working on distributed systems, these fundamentals are critical for career advancement.

Twilio and its systems

Twilio is a Communications Platform as a Service (CPaaS) company that enables businesses to easily communicate with customers and others via text messages, emails, and calls.

The company is well-known for providing programmable communication tools on various communication channels using web service APIs. During his tenure, Mark Gilbert oversaw the Twilio Super Network and onboarding of phone numbers.

With the Super Network, Twilio combined the communication networks of multiple global carriers to support channels like voice, text messaging, email, and video reliably.

Getting customers onto the Super Network involved onboarding numerous phone numbers and working on international infrastructure. Regarding Twilio’s tech stack, different teams would work on different system parts. Teams often gravitated toward their preferred programming languages, but then Twilio began pushing to migrate most of the systems to Java, valued as a type-safe language.

For data storage, Twilio used multiple solutions.

When it started, Twilio had to build much of its infrastructure, including components that are now available off-the-shelf from cloud providers like Microsoft, Amazon, and Google. Now, the scale of the systems at Twilio is quite large. On a big day, Twilio sends single-digit billions of emails and hundreds of millions of text messages.

Traffic is spiky, and the distribution of all those emails and messages is not even throughout the day.

Because of this and other challenges particular to Twilio’s technology, scaling was a continuous challenge and top concern for everyone at Twilio as it built out its systems. To meet that challenge, Twilio paid a lot of attention to System Design principles and carefully constructed its architectures.

Six takeaways about System Design and architecting for scale

The continuous challenge of scaling at Twilio, as described in the case study, led to the following six key takeaways.

1. Add capabilities to existing architectures when possible

Leaders and engineers at Twilio debated daily about how to improve systems.

All that discussion was for a good reason: Twilio’s customer growth was large, and making changes with all those customers on the system was a major challenge. It became essential to determine the safest and most efficient ways to improve.

To determine system requirements, the teams went back to basics, looking at expected traffic scenarios, doing the math, and penciling out: what’s going to get there, what’s not going to get there. The real options involved adding role instances (nodes/servers) where they could. Sometimes it worked. Sometimes it worked, but inefficiently, and they had to look at why.

2. Sometimes adding servers is not the solution

Certain system challenges at Twilio defied the solution of adding extra servers.

For example, Twilio works with thousands of mobile operators and providers worldwide. A lot of these providers don’t have modern systems that scale. This would lead to problems with downstream capacity, as those providers’ systems ran into problems when they got more messages than they could handle.

Some providers might drop the messages altogether, while some would let them go later.

And they didn’t return clean errors. All these different kinds of "weird behaviors" can happen. In these situations, adding servers didn’t solve the problem. In addition, some of Twilio’s systems created a lot of communication between servers.

Ultimately, capacity issues would occur.

The amount of “chatter” might become so high that adding role instances wouldn’t help scale the system. Then, the answer involved having programmers review the communication, the services that created it, and the partitioning of the servers. To change those things while they were constantly operating was a fairly heavy lift.

When adding servers wouldn’t do the trick, the next step was to rearchitect certain pieces of the system for scalability. This became a high-priority and time-intensive effort.

3. Optimize the number and size of microservices

Twilio used a microservices architecture.

As with many large-scale cloud-based services that emerged in the mid-2000s, Twilio employed microservices, although not to the extent they would have liked. Start-ups must experiment and revise their designs frequently initially. At this stage, Twilio had to weigh the potential benefits of a microservices architecture against the costs of later changing it.

The trick was finding a sweet spot: not too many microservices, but not too few.

If someone wanted to split an app’s architecture into microservices to the utmost, it should be reconsidered. However, if someone has everything on one service, you may need to reconsider that approach. There were two main considerations when determining whether a system’s microservices were getting too small or numerous:

Diagnostics and debuggability: A plan is needed for investigating errors and failures.
Performance: If service boundaries are not tightly coupled, performance issues will appear.

Trying to find the optimal number of microservices prematurely is challenging.

It is necessary to look at requirements for the front-end (responsiveness) and back-end (reliability), and try to anticipate what services may require splitting up. Then, as the app receives more traffic, it can be seen with greater confidence which microservices are likely to have issues.

4. Plan around services’ expected lifespans

To scale from prototype to production, one must pencil out how long components are expected to last. To do so:

Break down the app’s architecture into services: What are the pieces? How will they communicate? What are the resulting requirements?
Figure out which services are planned to be kept and those expected to be replaced: Invest time into understanding how to scale services planned for consistent use. Put appropriate service boundaries around them. Don’t waste as much design time on pieces for which there is less confidence.

At Twilio, this was applied to a large communication service. The team identified that the underlying components (text messages and emails) would be consistent, so they invested more design effort to ensure they could scale. At the same time, they anticipated having to rewrite the user interface multiple times, so they invested less in architectural thinking.

The challenge lies in discerning which services are which, as expectations often go awry.

5. Companies will increasingly build more flexible systems

Amid growing dependence on the cloud and software as a service (SaaS), tech companies will increasingly build systems out of small services that can be deployed dynamically.

During the 2008-2010 period, many start-ups claimed to be using microservices, but in reality, they were also heavily relying on cloud infrastructure, such as AWS and Microsoft Azure. The cloud was fast to build on and quickly reached a reasonable scale.

As a result, they would end up with a “start-up ball of wax”: systems they would continually have to claw pieces out of to scale differently.

Now companies are paying increasing attention to scaling, deploying into different regions worldwide, and breaking up services to run on different providers. They’re creating more thoughtful microservices and being careful about taking on dependencies.

Partly, this has resulted from shifts outside the tech industry, such as increased attention to data privacy, geolocation of data, and residency.

Companies will increasingly use a model that combines cloud provider services with services they integrate into their own infrastructure and deploy elsewhere. This approach scales more effectively and provides companies with greater control.

6. In technical interviews, give real examples and explain the thinking

Interviews at Twilio included questions assessing how well candidates understood System Design fundamentals and writing code.

But interviewers also drilled into candidates’ experience working on systems, their reasoning behind decisions, and what they’d learned from their choices. Interviewers would continue to question candidates to determine whether they had truly done the work, how well they understood scale limitations, and what they would do differently in the future.

You can pretty soon tell whether someone understands the higher-level abstraction and whether the underlying ‘how this works’ is somewhat understood—Mark Gilbert

For more experienced candidates, interviewers would look for designing and redesigning for scale in their backgrounds. Improving scaling on existing services is a measurably harder problem that engineers might face as they gain experience.

The black-belt ninja moves are taking large systems in flight and trying to scale those more—Mark Gilbert

For candidates with fewer years of experience, questions relating to Twilio’s values might be more critical. For example:

Are you adding positive value?
Are you curious?
Are you learning?

Even candidates fresh out of school or self-taught would get questions about what they had built and how it had worked. Regardless of their experience, candidates who could provide concrete examples and explain their thinking stood to perform better.

The lessons from interviews at Twilio apply more generally:

Show up to interviews prepared to discuss not only the projects worked on, but also the thought process and lessons learned from those experiences.
Any project brought up should be one in which the candidate played a significant role and understands thoroughly.
For the System Design Interview, be prepared to devise solutions to hypothetical problems.
Ensure you have studied the company’s values beforehand and be prepared to demonstrate alignment with them.

Conclusion

The six takeaways from Twilio’s experience emphasize one central theme: thinking strategically about System Design has become vital in the modern tech landscape, from small start-ups to FAANG companies.