In this post, we’re going to explore how chaos engineering helped us to ensure the resiliency of the Cisco Webex Contact Center (Webex CC) platform, share some of the tools and issues we’ve learned along the implementation journey, and why we believe our product has industry-leading resiliency.
Webex CC is a Contact Center as a Service (CCaaS) offering, that enables smart, proactive, and personalized interactions across the customer journey.
Webex CC is architected, designed, and developed, from the ground up, as a cloud-native solution, with the following core architectural principles.
At Webex CC, we understand that the reliability and availability of our platform are critical to our customer’s success. That’s why we’ve invested in designing and implementing robust systems and processes that enable us to detect and mitigate potential failures before they impact our users.
The Webex CC platform employs infrastructure-as-code and modern GitOps practices, where Git repositories serve as the authoritative source for delivering infrastructure-as-code methodologies. It incorporates high availability through the utilization of state-of-the-art technologies such as Kubernetes, Kafka Messaging, Istio Service mesh, and secrets management, in conjunction with Amazon Web Services (AWS) managed services including Relational Database Service (RDS)and OpenSearch.
This has been achieved through the utilization of clustered services that are spread across multiple Availability Zones (AZs) as illustrated in the accompanying figure.
One of the key practices we’ve adopted to achieve this level of resiliency is chaos engineering. Chaos engineering is a methodology that helps us identify and address potential weaknesses in our systems by intentionally introducing failures in a controlled and safe manner.
Chaos testing (also known as chaos engineering) is a method of testing software systems and infrastructure by intentionally introducing failures or disruptions to see how the system responds.
The service disruption scenarios are designed based on system analysis and learnings from previous incidents where Webex CC applications interact with external or platform services that it depends on.
The service disruptions scenarios include:
These test scenarios are based on the high-level system view which is depicted in the diagram below.
During periodic meetings with the AWS Technical Account team, the Webex CC engineering teams were provided suggestions on how to conduct chaos testing.
As part of these discussions, AWS suggested leveraging their AWS Fault Injection Simulator (FIS) for certain tests, while recommending that other tests (which required removing routes from Security Groups and S3 Access Denials) be done manually. Furthermore, for injecting failures at the Kubernetes Pod level, an open-source tool called Chaos Mesh was also recommended.
To achieve better control over manual service disruptions during the Webex CC chaos tests, various methods were utilized, including:
These methods were employed to create disruptions intentionally and manually for the scenarios mentioned earlier, thus avoiding the need for multiple tools to perform the tests comprehensively.
Chaos tests identified call failures, agent login failures, and reporting failures that have since been addressed to improve the resiliency of Webex CC.
Some highlights include:
We are currently in the process of expanding our Webex CC chaos test coverage. We are also exploring the potential for automating these tests, including post-recovery actions, as part of our regression tests. This will allow us to execute chaos tests automatically when needed, eliminating the need for manual execution after the initial setup, similar to our automated and load tests.
Webex CC Engineering teams have seen how chaos engineering can be a powerful tool for improving system resilience by proactively identifying and addressing potential failure scenarios. As with any new testing methodology, it can be challenging to know where to start. Therefore, our team recommends starting small and gradually increasing complexity to avoid being overwhelmed by the process. This approach will help to identify any weaknesses in the testing methodology and allow for iterative improvements.
Furthermore, it’s crucial to keep in mind that the primary objective of chaos engineering is not to deliberately induce chaos without purpose. Rather, it is to utilize the insights gained from testing to enhance resilience, and concurrently improve monitoring and alerting systems for more efficient issue detection and response.
Implementing chaos engineering can be a valuable way to improve your system’s resilience. Start small, iterate, and use the insights gained. By doing so, you will be better equipped to handle any unexpected events that may arise in the future.
We would like to extend our heartfelt thanks and appreciation to Anuj Butail, Nikola Bravo, and Neelesh Adam from AWS who worked closely with us to understand our testing objectives and provided us with a clear roadmap for executing chaos testing.