Resilience testing
- Intentionally cause problems during the work day and see how the tools and the team react.
- Randomly kill processes and compute servers in production to see how the monitoring system and the whole team reacts
- Do this often during work hours and reduce the risk of such thing happening during the nights.
- Fix any issues. Learn.
- Netflix Chaos Monkey