By Nick Floyd
Prior to coming to New Relic, I did very little thinking on the topic of game-day testing. I’d always thought that was the type of thing you did when you were too lazy to write automated tests. Or when the dev process was so broken and throwing so much randomness at the MVP (minimum viable product) that it was the only way you could get confidence in what you were shipping.
I was wrong.
The fact is, you’ll never be 100% certain what your system or processes will do under various scenarios. I was ignorant in assuming there was little value to the practice until I needed it.
The idea of game-day testing is simple (although definitions may vary):
- Throw good and broken things at your system and see what happens.
- Take what you learned and make your team and your system better.
Or more simply put: It’s the kill -9 of testing.
I think one of the major problems with “game-day” testing is its name; it sounds like it should always be done at a fixed time or using a prescribed format that’s always run by human hands. Of course, the good folks at Netflix have dispelled that idea with their chaos monkey, a simian army that purposely wrecks things on a frequent basis, helping the software (and the developer teams) over at Netflix get seriously strong.
Before you test
If you are thinking about trying out game-day testing in your own organization, keep these important considerations in mind before you begin:
- Game-day testing can be done collectively or individually (though it tends to be more productive when done in a group).
- Get people outside of the team to help you create different scenarios. You can’t even begin to imagine the things developers come up with when they hear “break my system.”
- Many of the things included in your scenarios are, or can be, automated, but doing things manually on occasion is a healthy way to better understand the current state of your system.
- Be random. By that I mean choose a few of the scenarios and “mix them up” instead of just running through a script.
- If working in a group, whiteboard what you think will happen when you execute a scenario. And if your expectations do not match your results, dig into why that might be.
- Notify, notify, notify. Communicate to all parties who might get alerts and notifications about this system experiencing trouble that you are running tests.
- If you encounter surprises or bugs during your test, be sure to either fix them on the spot or document the steps to reproduce them so you can work on them later.
- All artifacts in your game-day “system” (documents, scripts, users, and systems) should be organic and updated often.
Writing up a basic game-day testing scenario
- Details. What the test is and why you are running it (generally just a paragraph).
- Where. Details on where the game-day testing will take place. You can list things like servers or application URIs here.
- The process. Details such as how frequently game-day testing should take place and the steps needed to execute a game-day scenario. Provide a link to your “mock” incident game-day doc. Also, call out if this is a good ad-hoc scenario (that is, a scenario that can be run outside of the context of a normal game-day test).
- The important stuff. Here I call out the exceptional. Meaning that if there were a list of three things that were “musts” for the test to run, or that were critical to the business, document what they are and how can you can game-day them.
- The unexpected. This is my favorite section. This is where I call out how to create unpredictable events in my system. These can be automated (like Netflix’s chaos monkey) or manual such as “turn off web jobs.”
- What to do with issues found. This is a great section to focus on some process. Here you can detail processes and actions to take when the system reports unexpected results. Generally, you might detail your bug reporting workflow or hotfix process. This is an important section because you want to call out the fact that the game-day process should create artifacts that you can act on.
- Holes and future changes. Consider this your informal backlog. I put it with the game-day because I want to call out “Hey, we’re not covered here!” or “I really think we can improve here.” This way, pain points are always top of mind.
Schrödinger’s cat: wanted dead or alive
Finally, the super fun stuff. What could be more fun than breaking your software? Breaking your processes, of course!
My team was recently involved in a big project with many interactions both inside and outside New Relic. During the course of this project there were a lot of changes: people working on the project came and went, priorities shifted, and we had a lot of ups and downs with external communications. These things were hardly unique to our team but we wanted to face these challenges head on. After a series of process failures, we asked ourselves: “How are we going to know if our processes are healthy if we never consistently test them?”
So we began building and doing things like:
- The Raspberry PI beacons that would report to New Relic Insights on the health of our sub-systems (uniquely named after each of the Flatliners: Kevin Beacon, Julia Robots, Oliver Platt-a-pi, and others).
- We started iterating on the ill-tempered sea bass, which would create mock PagerDuty alerts to let us know if our scheduling for PagerDuty was working.
- We generated mock “game-day outages.”* When new people went on call, we’d have the system cause a catastrophic event in our staging environment. The fun part was that they would never see it coming, so it stressed our processes, challenged the new person to get up to speed, and trained them to more calmly handle real outages.
I think the most important aspect of game-day testing is that you actually do it—when you run these tests, your app might survive or it might not. Like Schrödinger’s cat, you won’t know until you “open the box” after the run. Game-day testing is meant to prevent future misery and, more important, as a software developer you should always make sure to take the hits before your users do.
So what do you and your teams do to test the inherently unpredictable nature of software mixed with people and processes?
*A note on the mock game-day outages: While this seems like an unkind thing to do, we definitely would let our people know that some aspect of the ecosystem was going to fail in a given week. The main idea behind this was to not only prepare our team for potential outages but also to fine-tune our processes so they were streamlined and easy to follow.