Countdown and Launch
Learn about what can happen at the launch of a site, all the unforeseen circumstances, Conways law, production configurations and production deployment structure.
We'll cover the following
After years of work, the day of launch finally arrived. I had joined this huge team (more than three hundred in total) nine months earlier to help build a complete replacement for a retailer’s online store, content management, customer service, and order-processing systems. Destined to be the company’s backbone for the next ten years, it was already more than a year late when I joined the team. For the previous nine months, I had been in crunch mode: taking lunches at my desk and working late into the night. A Minnesota winter will test your soul even under the best of times. Dawn rises late, and dusk falls early. None of us had seen the sun for months. It often felt like an inescapable Orwellian nightmare. We had crunched through spring, the only season worth living here for.
One night I went to sleep in winter, and the next time I looked around, I realized summer had arrived. After nine months, I was still one of the new guys. Some of the development teams had crunched for more than a year. They had eaten lunches and dinners brought in by the client every day of the week.
Countdown and launch
We’d had at least six different “official” launch dates. Three months of load testing and emergency code changes. Two whole management teams. Three targets for the required user load level (each revised downward). Today, however, was the day of triumph. All the toil and frustration, the forgotten friends, and the divorces were going to fade away after we launched.
The marketing team, many of whom hadn’t been seen since the last of the requirements-gathering meetings two years earlier, gathered in a grand conference room for the launch ceremony, with champagne to follow. The technologists who had turned their vague and ill-specified dreams into reality gathered around a wall full of laptops and monitors that we set up to watch the health of the site.
Site in production
At 9 a.m., the program manager hit the big red button. The new site appeared like magic on the big screen in the grand conference room. The new site was live and in production.
Of course, the real change had been initiated by the content delivery network (CDN). A scheduled update to their metadata was set to roll out across their network at 9 a.m. central time. The change would propagate across the CDN’s network of servers, taking about eight minutes to be effective worldwide. We expected to see traffic ramping up on the new servers starting at about 9:05 a.m. The browser in the conference room was configured to bypass the CDN and hit the site directly, going straight to what the CDN called the “origin servers.” In fact, we could immediately see the new traffic coming into the site.
By 9:05 a.m., we already had 10,000 sessions active on the servers.
At 9:10 a.m., more than 50,000 sessions were active on the site.
By 9:30 a.m., 250,000 sessions were active on the site. Then the site crashed. We really put the “bang” in the “big bang” release.
Aiming for quality assurance
To understand why the site crashed so badly, so quickly, we must take a brief look back at the three years leading up to that point. It’s rare to see such a greenfield project, for a number of good reasons. For starters, there’s no such thing as a website project. Everyone is really an enterprise integration project with an HTML interface. Most are an API layer over the top of back-end services. This project was in the heyday of the monolithic “web site” on a commerce suite. It did 100 percent server-side rendering.
When the back end is being developed along with the front end, we might think the result would be a cleaner, better, tighter integration. It’s possible that could happen, but it doesn’t come automatically; it depends on Conway’s law. The more common result is that both sides of the integration end up aiming at a moving target.
In a Datamation article in 1968, Melvin Conway described a sociological phenomenon: “Organizations which design systems are constrained to produce designs whose structure are copies of the communication structures of these organizations.” It is sometimes stated colloquially as, “If you have four teams working on a compiler, you will get a four-pass compiler.”
Although this sounds like a Dilbert cartoon, it actually stems from a serious, cogent analysis of a particular dynamic that occurs during software design. For an interface to be built within or between systems, Conway argues, two people must, in some fashion, communicate about the specification for that interface. If the communication does not occur, the interface cannot be built. Note that Conway refers to the “communication structure” of the organization. This is usually not the same as the formal structure of the organization. If two developers embedded in different departments are able to communicate directly, that communication will be mirrored in one or more interfaces within the system.
I’ve since found Conway’s law useful in a proscriptive mode, creating the communication structure that I wanted the software to embody, and in a descriptive mode, mapping the structure of the software to help understand the real communication structure of the organization. Conway’s original article is available on his website .
Replacing the entire commerce stack at once also brings a significant amount of technical risk. If the system is not built with stability patterns, it probably follows a typical tightly coupled architecture. In such a system, the overall probability of system failure is the joint probability that any one component fails.
Passing QA vs. production
Even if the system is built with the stability patterns (this one wasn’t), a completely new stack means that nobody can be sure how it’ll run in production. Capacity, stability, control, and adaptability are all giant question marks. Early in my time on the project, I realized that the development teams were building everything to pass testing, not to run in production. Across the fifteen applications and more than 500 integration points, every single configuration file was written for the integration-testing environment. Hostnames, port numbers, database passwords: all were scattered across thousands of configuration files. Worse yet, some of the components in the applications assumed the QA topology, which we knew would not match the production environment.
For example, production would have additional firewalls not present in QA. (This is a common “penny-wise, pound-foolish” decision that saves a few thousand dollars on network gear but costs more in downtime and failed deployments). Furthermore, in QA, some applications had just one instance that would have several clustered instances in production. In many ways, the testing environment also reflected outdated ideas about the system architecture that everyone “just knew” would be different in production. The barrier to change in the test environment was high enough, however, that most of the development team chose to ignore the discrepancies rather than lose one or two weeks of their daily build-deploy-test cycles.
When I started asking about production configurations, I thought it was just a problem of finding the person or people who had already figured these issues out. I asked the questions:
“What source control repository are the production configurations checked into?”
“Who can tell me what properties need to be overridden in production?”
Sometimes when we ask questions but don’t get answers, it means nobody knows the answers.