The new era of resiliency in the cloud
Organizations are eager to capture their fair share of the estimated $3 trillion opportunity in EBITDA lift that can be enabled using cloud platforms. An important element in getting that value relies on the resilience of applications running in the cloud, especially since much of the cloud value at stake is dependent on running mission- and business-critical workloads.
Moving to the cloud can significantly improve stability compared to on-premises environments. The cloud can offer faster recovery time, more flexibility to support resiliency, and more tools that provide sophisticated resiliency capabilities. But to capture these benefits, companies need to design, architect, and implement the right resiliency patterns to meet business and customer needs. Many of these patterns revolve around simplifying the technology to reduce technical debt, automating functions, and enabling applications to take advantage of the cloud’s capabilities.
Resiliency is critical for capturing cloud value
For a variety of reasons, including rising customer expectations, many heads of business have expressed concerns about ensuring that their mission-critical applications are always running and available. Adding to those concerns is the evolving regulatory landscape.
These concerns partly explain why fewer companies than expected have moved critical business applications to the cloud. In fact, after surveying executives and technology leaders at many of the world’s largest organizations, we found that fewer than 10 percent have successfully moved mission- and business-critical processes and workloads (tier 1) to the cloud. By confining their cloud migrations to less-critical apps, they miss out on significant value, since the return is significantly less than it would be on tier 1 workloads.
Addressing these issues starts with doing two sets of foundational analysis. The first is understanding the financial and reputational consequences of an outage for the business. Companies often significantly overvalue or undervalue these costs, which leads to uninformed decision making. The second is understanding the all-in costs of on-premises operations versus those in the cloud. The cloud can be much more cost effective because its elasticity can accommodate usage surges and reduce costs with its pay-as-you-go model. If the application is well architected, its infrastructure is well architected and properly configured, and the life cycle of the existing hardware is factored into the calculation, you will see a cost benefit. By shifting mission-critical workloads into the cloud at the right time—ideally prior to any upcoming hardware-refresh cycle—companies can avoid expensive and unnecessary capital investments in hardware replacement. Developing these insights requires strong FinOps capabilities.
Misunderstandings about resiliency
McKinsey’s cloud resiliency work over the past two years has identified common pitfalls that business leaders can face when considering application resiliency issues related to migrating and operating workloads in the cloud.
One issue that often bedevils companies is a failure to understand resiliency roles and responsibilities. Businesses and cloud service providers (CSPs) have distinct roles to play in fully realizing cloud benefits. CSPs, for example, are responsible for the reliability and security of the fundamental services of their cloud platforms (compute, network connectivity, and physical data center), while their clients are responsible for architecting the reliability, security, and overall resiliency of their workloads hosted in the cloud.
Another issue is that companies are often imprecise in their use of terms that define key concepts that underpin cloud resilience—specifically, high availability, fault tolerance, and disaster recovery. Using these terms interchangeably can increase the risk of providing faulty or inaccurate resilience requirements. Here are definitions of these key elements of resiliency:
- High availability refers to systems being available in the event of an underlying failure.
- Fault tolerance defines the system’s ability to recover from an underlying system failure—commonly achieved through system mirroring, application logic, and configuration.
- Disaster-recovery defines the amount of time needed to recover from a failure after it happens and is measured by key metrics including mean time to recovery (MTTR), recovery time objective (RTO), and recovery point objective (RPO).
The five main actions to architect resilient cloud applications
Ensuring the appropriate level of resiliency in the cloud requires IT to take five key actions.
1. Identify and update application tiers and revisit key performance metrics
Most organizations have already defined application priority tiers and mapped applications to them. Migrating workloads to the cloud often provides an opportunity to reevaluate applications and the customer journeys they support. This is also the time to revisit the service-level objectives for each tier (RPO, RTO, and MTTR), as recovery times in the cloud may be significantly better than those achievable on-premises.
2. Map resiliency patterns applicable to the application tier
Implementing a combination of prescribed infrastructure and application patterns creates a strong foundation for resiliency. But it requires a clear assessment of how each application’s architecture can enable the infrastructure and resiliency approaches to meet resiliency targets. In essence, this is about grouping your workloads by how critical they are to the business (tier 1 is mission critical; tier 2 is business critical) and then determining which resiliency patterns are best suited for each. Some of those patterns will be specific to the infrastructure and others to the applications themselves.
There are seven distinct infrastructure resiliency patterns and six application architecture patterns that can be used to fortify workload resiliency. The following exhibit provides an overview of what the mapping looks like.
Each pattern has its own set of methodologies, approaches, and capability requirements. Mapping the infrastructure and application patterns needed by application tier provides a standardized approach to ensure workloads operate with consistent resiliency. Organizations need to routinely reference this resiliency map to test and validate organizational and regulatory requirements.
3. Tailor reference architectures to application tiers
While predefined infrastructure and application resiliency patterns provide high-level guidance, it is best practice to take it a step further and have clearly defined reference architectures. Having reference architectures by application tiers serves as a guide when migrating existing workloads to the cloud. Reference architectures also provide a useful way to organize greenfield workloads, which can both accelerate time to value and help map technical and business resiliency requirements.
4. Define and prioritize the right level of resiliency according to business needs
Not all resiliency efforts are equal. It is best to bifurcate types of workloads when thinking about cloud workloads and the level of effort required to achieve their respective resiliency requirements:
- Cloud-first workloads are workloads explicitly designed to function in cloud environments. They often lack the level of technical debt that burdens legacy systems. Thus, the resiliency of these applications is often easier to strengthen than that of many legacy applications.
- Cloud-eventual workloads were built on mainframes or on-premises virtual machines (VMs) without factoring in the capabilities of the cloud. These workloads will benefit from some level of modernization if migrated to the cloud, to decrease costs and increase resiliency.
This framing can help organizations understand the effort and costs associated with addressing resiliency needs and their prioritization. An organization needs to decide which applications should go into the cloud based on business requirements and configuration needs in order to achieve a base level of required resiliency. The output should be a heat map, showing cost or complexity in addressing resiliency for sets of applications based on their importance to and impact on the business. This provides a clear path forward to prioritize and sequence resiliency efforts.
5. Define a road map that starts with top-tier applications
Organizing and allocating resources to do the resiliency work requires translating the first four steps into a clear road map. The best road maps define clear milestones and KPIs to better track progress and explicitly prioritize work based on value. In practice, this often means putting resources into top-tier applications, where the business value is greatest. Operating these critical workloads with a well-architected resilient foundation provides companies with greater agility, economic viability, and confidence where it matters most. In addition to ensuring that these workloads are resilient, running them in the cloud requires specific foundational and operational capabilities.
Bank upgrades its resiliency for workloads in the cloud
The experience of a bank and its payment-processing service illustrates the benefits of working with the resiliency framework outlined in the previous exhibit. The bank had an infrastructure that did not allow it to scale and consistently meet its service-level targets. Its payment system was plagued with multiple years of accrued technical debt, causing both poor resiliency and cycle times that were inadequate for improving and expanding new features. The bank also used multiple cloud providers, which increased the complexity and challenges in managing resiliency.
Things had to change. To achieve the necessary level of resiliency and agility, the bank opted to shift most of its payment-system applications to a single CSP using a multiregional active/active configuration. This allowed it to simplify its operating model and better support its applications. The bank then systematically organized and prioritized its workloads to best determine which resiliency patterns to apply to its infrastructure, which allowed it to identify the workloads that would provide the greatest value.
By applying a combination of the infrastructure resiliency patterns depicted in the exhibit, the bank was able to reduce complexity and increase resiliency. Automation drove a fivefold increase in deployment speed. The elasticity afforded by the cloud platform also enabled the bank to dynamically scale almost 100 percent of its applications up and down based on customer demand. Moreover, the bank achieved 99.999 percent uptime, with only about five minutes of maximum downtime per year.
Launching a cloud resiliency program
Given the importance of application stability and the financial and reputational risks associated with failing to achieve it, companies should approach cloud resiliency with dedicated resources and sufficient focus. Five key steps are critical in preparing for launch:
- Prioritize critical business journeys and associated applications by mapping the company’s strategic objectives to outputs of existing business impact analysis.
- Pinpoint key vulnerabilities and technical constraints for the key migration journeys’ systems, and define the resiliency patterns that need to be implemented to meet business resiliency requirements.
- Create a road map that focuses on rapidly implementing a trial “lighthouse” to allow the organization to learn while proving value.
- Identify gaps to address in conjunction with the architecture, such as processes (management of incidents, problems, and change) and talent (unfilled engineering roles in the organization).
- Establish targets for the resiliency program that are aspirational, yet achievable, in terms of alignment to business continuity, duration, and cost.