Production Issues

Learn about production outages that may disrupt the system and corresponding approaches to resolve them.

Overview

In this lesson, we'll discuss the complexity of the issues in the production environment and then discuss security concerns within an application.

Identifying and resolving a production issue

At some point, you might encounter a partial or complete production outage of a system you're supporting. Start by determining the scope and the source of the problem. The recent deployment may not be the cause of the outage.

Don't start by restarting everything, as this can make things worse. First look for outages in related systems:

  • Does your vendor have an outage?

  • Are other customers of your vendor down too?

They may not be aware of the issue yet. When you have identified a solution and tested it live, it may require a subset of the unit tests to be temporarily ignored (don’t do this in a regulated environment). This is fine if you can manually test the item. After you have the system back, restore the ignored tests and consider what could be added or changed about the system to make future failures easier to handle.

Note: It's important that you be able to determine a problem without having to rely upon your customers to tell you. 

When investigating a system that has failed, you might find a lot of unrelated errors. Record these carefully and fix them when the crisis is over. Cleaning up the logging during a crisis can be a worthwhile exercise because it can help you determine when the system is healthy again. 

Get hands-on with 1200+ tech skills courses.