As engineers we expect our systems and applications to be reliable. And we often test to ensure that at a small scale or in development. But when you scale up and your infrastructure footprint increases, the assumption that conditions will remain stable is wrong. Reliability at scale does not mean eliminating failure; failure is inevitable. How can we get ahead of these failures and ensure we do it in a continuous way?
Ana Margarita Medina, a Staff Developer Advocate at Lightstep and Darko talk all things about SRE (Site Reliability Engineering) and DevOps. They discuss the finer topics of both and the differences between them.
How does your team prepare for failure and learn from incidents? GameDays are a time to come together as a team and organization to explore failure and learn. This practice has been done across most industries, from fire departments to technology companies.
Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally break things on purpose to learn how to build more reliable systems. Lenny Sharpe walks you through Chaos Engineering at Target, covering the tools and practices you need to implement Chaos Engineering with Kubernetes in your organization. Even if you’re already using Chaos Engineering, you’ll learn to identify new ways to use the practice to improve the reliability of your network and services. Ana Medina will share a demonstration of how you can practice Chaos Engineering on Kubernetes and use it to improve the reliability of your systems.