I had the good fortune of attending a workshop about responding to production incidents, led by the folks behind Blackrock 3. I plan to share, over several posts, what I learned with the community at-large and to apply what I learned within the Cloud Foundry team — we’re going live in the very near future, and we are taking incident management and response very seriously.
One of the most important things I learned from the workshop was understanding the attitude and mindset necessary to even begin to handle an incident.
A running analogy in the workshop was that your job (if not your business!) actually operates in two “modes”: peacetime and wartime.
Peacetime operations are what most of us in software development are used to: business as usual, relatively low pressure and stress, ample time to make difficult decisions, etc.
When it comes to incident response, that business-as-usual attitude needs to change immediately. You will be working under pressure; you will have to make difficult decisions, choosing from non-optimal options. There isn’t time for panic, and there isn’t time to procrastinate. You’re at war now.
As the primary responder to an incident, you will carry a burden. It’s not just your immediate team that has entrusted you with this responsibility of managing this incident — but also your entire organization, your customers, and your company’s stakeholders. You will need to triage incidents quickly. If you won’t be able to solve the problem on your own, you need to be able to immediately contact the right people to assist you. You will need to follow your incident response plan.
You should have buy-in from the whole organization, from the top down, for your incident response plan. If any incidents are routine, you should be able to handle them with minimal effort. Your plan will not cover everything that will ever come up, but it should be flexible and robust enough to ensure that small problems don’t grow into catastrophes due to the way the incident was handled.
Incidents will occur, and not all of them will be handled smoothly. Your team and organization need a culture of trust. Responders must not be afraid of the consequences of full disclosure, and they must not be paralyzed in fear of consequences of making a poor decision during an incident response. You need to be able to objectively reflect on your process in handling an incident so that you can handle future incidents even better.
Not everyone will share my point of view, but I’m excited to fill this role on the Cloud Foundry team. I’ve been coding for something over a decade now, and while it’s always been fulfilling in its own way, the biggest adrenaline rush I’ve ever had on the clock was an unannounced fire drill. That’s going to change now; it feels like moving beyond just a desk job. Work is going to involve some excitement, fear, and panic, and I’m ready for it.