Hunting For Clues

Learn about database, RAID configurations, application server configurations, and Java thread dumps to find the root cause of the airline incident.

Checking database and RAID configurations

In the morning, fortified with quarts of coffee, I dug into the database cluster and RAID configurations. I was looking for common problems with clusters: not enough heartbeats, heartbeats going through switches that carry production traffic, servers set to use physical IP addresses instead of the virtual address, bad dependencies among managed packages, and so on. At that time, I didn’t carry a checklist. These were just problems that I’d seen more than once or heard about through the grapevine. I found nothing wrong. The engineering team had done a great job with the database cluster. In fact, some of the scripts appeared to be taken directly from Veritas’s own training materials.

Checking application server configurations

Next, it was time to move on to the application servers’ configuration. The local engineers had made copies of all the log files from the kiosk application servers during the outage. I was also able to get log files from the CF application servers. They still had log files from the time of the outage, since it was just the day before. Better still, thread dumps were available in both sets of log files.

Get hands-on with 1200+ tech skills courses.