What is Chaos Engineering? | IBM

April 28, 2024

68 2 minutes read

What is Chaos Engineering? | IBM — 1714315415 cq5dam.medium.1584.1584.png

Every day creates a new opportunity for an organization’s critical application or infrastructure to fail, potentially threatening its ability to deliver services to customers. Causes of failure can vary between several issues, including security breaches, misconfigurations or service disruptions. The likelihood of errors or disruptions can rise as more applications and data are hosted in the cloud, which can create an increase in security issues.

One way to address disruptions is chaos engineering. It is not a random process where engineers terminate instances or services or otherwise cause systems to fail without any purpose. This process identifies potential future issues, allowing engineering teams to solve problems proactively and avoid them in the live environment further down the road.

Chaos engineering is important because an error or disruption can slow down an organization’s momentum, expending precious time figuring out a solution on the fly as downtime increases. Netflix learned this concept firsthand when it switched from on-premises to the cloud¹(link resides outside ibm.com)-they experienced an outage that led to a three-day interruption to service delivery in 2008.

This outage predates its transformation as a video streaming operation, which would have made that outage exponentially more costly. As a result, Netflix decided that it would do everything possible to minimize disruptions and it began to introduce chaos engineering into its workflows. This process allows them to identify issues before they happen and to minimize the damage if and when an unavoidable failure occurs.

Netflix created chaos monkey² (link resides outside ibm.com), an open source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures. They implemented chaos monkey when it moved from a private data center to Amazon Web Services (AWS) in response to unreliability from the cloud. Many organizations now use chaos monkey to run their chaos engineering experiments.

Chaos engineering is an important defense against infrastructure failures, outages or missing components in an organization’s production environment. It helps site reliability engineers (SREs) and other members of the DevOps team to provide continuous delivery of services by avoiding significant disruptions to their service. Chaos engineering helps them understand their vulnerabilities better and informs how to minimize the impact if a disruption occurs.

Even a small issue in code can have a catastrophic effect on the overall production environment given different program dependencies. For instance, an error in the transaction software system for a financial services firm can result in the loss of millions of dollars³(link resides outside ibm.com).

Organizations might be unable to avoid all IT incidents, but they can minimize the damage by using chaos management to understand likely scenarios and their best-possible solutions.

Source

April 28, 2024

68 2 minutes read