Searching

We were dealing with a retailer’s primary online brand. It had a huge catalog, half a million SKUs in 100 different categories. For that brand, search wasn’t just helpful. It was necessary. A dozen search engines sitting behind a hardware load balancer handled holiday traffic. The application servers would connect to a virtual IP address instead of specific search engines (see Migratory Virtual IP Addresses, for more about load balancing and virtual IP addresses). The load balancer then distributed the application servers’ queries out to the search engines. The load balancer also performed health checks to discover which servers were alive and responsive so it could make sure to send queries only to search engines that were alive.

Those health checks turned out to be useful. The search engine had some bug that caused a memory leak. Under regular traffic (not a holiday season), the search engines would start to go dark right around noon. Because each engine had been taking the same proportion of load throughout the morning, they would all crash at about the same time. As each search engine went dark, the load balancer would send their share of the queries to the remaining servers, causing them to run out of memory even faster. When we looked at a chart of their “last response” timestamps, we could very clearly see an accelerating pattern of crashes. The gap between the first crash and the second would be five or six minutes. Between the second and third would be just three or four minutes. The last two would go down within seconds of each other. This particular system also suffered from cascading failures and blocked threads. Losing the last search server caused the entire front end to lock up completely. Until we got an effective patch from the vendor (which took months), we had to follow a daily regime of restarts that bracketed the peak hours: 11 am, 4 pm, and 9 pm

Tips to remember

One server going down jeopardizes the rest

A chain reaction happens because the death of one server makes the others pick up the slack. The increased load makes them more likely to fail. A chain reaction will quickly bring an entire layer down. Other layers that depend on it must protect themselves, or they will go down in a cascading failure.

Hunt for resource leaks

Most of the time, a chain reaction happens when our application has a memory leak. As one server runs out of memory and goes down, the other servers pick up the dead one’s burden. The increased traffic means they leak memory faster.

Get hands-on with 1200+ tech skills courses.