Supervision Trees
Explore how to test supervision trees in Elixir OTP by focusing on black box testing of components and using exploratory manual testing to simulate failures. Understand why automated testing of supervision trees is challenging and how to increase confidence in application resiliency through manual process termination and observation.
Supervisors are one of the strongest selling points of Erlang/Elixir and the OTP set of abstractions. They allow us to structure the life cycle of the processes inside our application in a resilient way, making isolating failures a breeze.
However, supervisors are one of the toughest things to test that we’ve come across. The reason for this is that their main job is to allow our application to recover from complex and cascading failures, and these types of failures are hard to trigger on purpose during testing.
Imagine having a complex and deep supervision tree. Now imagine that a child in the corner of the tree starts crashing and doesn’t recover just by being restarted on its own. OTP works beautifully and propagates the failure to the parent supervisor of that child, which starts crashing and restarting all of its children. If that doesn’t solve the problem, then the failure is propagated up and up until restarting enough of your application fixes the problem (or, if it’s a severe problem, until the whole thing crashes).
Well, how do we test this behavior? Do we even want to extensively test it?
Testing the behavior of supervision trees
It’s hard to inject a failure during testing that isn’t solved by a simple restart, but that also doesn’t bring ...