Panos Kozanian – As every IT department around the globe are executing business continuity plans, Webex’s criticality to customers and its utilization soared.
Webex is part of the business continuity plan of over 95% of all Fortune 500 companies. COVID-19 changed our world forever, and working from home (WFH) became a necessity as sheltering in place became the new norm overnight. As every IT department around the globe executed their business continuity plans, Webex’s criticality to our customers and its utilization soared.
I’ll be sharing the story and some of the lessons learned as we scaled Webex to meet the new demand both on the technical and the process side.
Here’s a TL;DR of the lessons learned:
Before I dive into the story and the lessons learned, I want to recognize two things:
First, at Webex, we are most grateful for the people who are helping on the front lines without the luxury of being protected by the OSI layers of networking.
Second, we pride ourselves on being an extension of our customers’ IT departments. We recognize that the story of scaling Webex is a portion of the journey our customers went through. Our IT counterparts in each of our customers’ organizations deserve an incredible amount of recognition for effectively executing on some of the most challenging business continuity plans we will face in our lifetime. We at Webex are grateful to have been in a position to assist our customers, schools, hospitals, and governments around the globe in this unprecedented world event.
Webex runs the largest global network and data center backbone dedicated to real-time collaboration. In our 24/7 Network Operation Center (NOC) we observe global events regularly: typhoons, landslides, earthquakes, and internet route disruptions and congestions are all events that we observe and react to regularly.
On February 3rd our network monitoring alerts were triggered by a drastic increase in traffic from our multi-national customers in China that connect via our global network along with our China-based local customer that connects through a dedicated point of presence in China – physically separate from the rest of Webex. We eventually handled a network increase of 22x over our January baselines for that region, due to the shelter-in-place orders of our customers’ employees in China who were connecting to our Webex services.
At this point, our NOC’s assessment was that we had at the very least an epidemic and possibly a pandemic that could affect the broader APAC regions and we started to mobilize compute and network capacity increases in the region.
The penultimate week of February our Site Reliability Engineers (SREs) were observing an unexpected increase in all regions in our Year-over-Year comparison graphs. This is when we knew we had a global event coming our way and put a team together dedicated to scenario planning for different possible outcomes based on data from recent epidemics and pandemics.
Webex is part of the business continuity plan of over 95% of all Fortune 500 companies. We, therefore, had three scenarios planned out by our teams. One that estimated a 130% increase in peak utilization if the pandemic was fairly well contained, another that had a 150% increase in case a massive spread would happen, and a third that had a 200% increase as the “worst-case” we could imagine at that time. In retrospect, we were underestimating tremendously and being misled by how recent epidemics and pandemics (2009 swine flu) had played out. Despite it being an underestimate, we’re glad we started executing on scenario 3 early – the worst-case scenario.
Capacity increases are a common task at Webex, but the scale at which this one would come our way required coordination across a very wide range of teams that were each making multiple optimizations a day and looking at capacity through different lenses: CPU utilization for compute, Gbps for network, TB for storage, QPS for databases, etc… Each of these was converted to a common metric that was relatable to our engineers and our customers: “Peak Attendees”. Conversions to Peak Attendees were used to quickly identify where bottlenecks might show up in our aggregated models across our global data center footprint.
Scenario 3’s mitigations required an aggressive timeline. Webex, in this scenario, could temporarily run out of capacity in a given region. The plan included leveraging our global footprint, accelerated through our backbone, to deliver services to specific regions while other regions were sleeping.
Armed with plans for these three potential scenarios, we started executing on the mitigations of Scenario 3 – a 200% global increase – thinking that we were executing against what seemed like a worst-case scenario.
By March 2nd, we were deploying all capacity we had on hand across the globe and augmenting our backbones for a 200% increase. We also started provisioning burst capacity in public clouds to rapidly extend our global backbone.
To assist the world in transitioning to shelter in place, we also reduced restrictions that exist in our free offering (which is isolated from our paying customers’ services) to assist enterprises, schools, hospitals, and governments in moving to shelter in place: we removed all time restrictions, we allowed up to 100 participants per meeting and even provided phone dial-ins globally.
In early March, we started reaching out proactively to customers letting them know what we were seeing and that we had our entire company’s backing to support them. Our account teams backed by our executives offered assistance to our customers in transitioning to an all work from home employee base with documentation, tutorials, and training. These early proactive communications were the first of many to come, where our account teams worked closely with our customers to assist in the massive work from home transitions.
Webex runs our own cloud and global backbone: 16 data centers connected by a backbone that handles 1.5Tbps at peak. Our backbone further connects 5 different geographies directly to public cloud providers. In early March, we extended our services quickly into different public clouds in order to reserve capacity. This served us incredibly well as different cloud providers were themselves scaling to meet the demand coming their way.
By March 9th, most companies with flexible WFH policies had all of their employees work from home (including all of us Webex employees). We could also see in our capacity utilization that our scenario 3 was not going to be sufficient. We established a state of emergency which included getting resources across all of Cisco to assist so we could support our customers. Our own CEO, Chuck Robbins, joined our state of emergency bridge and gave us a single focus to tackle: “Do everything you need to do for our customer’s business continuity. You have the entire company behind you.”
Cisco is the number one manufacturer of networking equipment. We produce our own servers and have some of the best relationships in the industry with service providers. In the first week of March, we knew it was time to get all of the help we could across the company and our partners to support the business continuity plans of our customers and we were on a war footing to provide further capacity globally to Webex, led by our own CAP managers assisting in the coordination of 24/7 bridges with our vendors.
Our COVID-19 scale-up was a unique type of incident management: its duration was nearly 100 days (from low to peak utilization), its scale was unprecedented (growing a mature business to 400%) and its impact broad (all parts of our Webex business were involved and needed coordination). Below is the process diagram that shows how we ran the efforts 24/7 for 100 days leveraging Webex Meetings itself for all the coordination.
The Change Commanders in the graph above were given full autonomy to do what they needed to sustain their area. They were also intentionally separated out of incidents and escalations by the Unified Incident Commander who was acting as a control tower for each of the scale and incident activities. This allowed the scaling team to stay focused on scaling the service while Incident Commanders were handling specific hot spots.
To give you a sense of what a 24h period on Webex looks like, you can see the graphs below. The only time of any reprieve on regional all-time high load increases was the 4h period between 22:00 and 02:00 UTC – when the sun is setting on the Americas and about to rise on APAC.
By Monday, March 23rd nearly all companies, governments, and schools were sheltering in place. At this point, we had all of our processes and engagements in place, constrained only by the time-to-delivery of hardware. We could see countries around the globe lighting up as they used Webex for their business continuity plans. We also noticed a new wave of education customers around the globe who have their own discernible access patterns.
The unique access pattern of our education customers includes higher utilization of video joins, more utilization of recordings, meetings more concentrated in time, and high geographic density of participants. We quickly optimized network paths and expanded further into hyperlocal public clouds for these new education customers.
The first week of April was the first time in 60 days where we saw less than double-digit growth week over week. In a twist of fate, security-aware governments and enterprises around the world noticed the security flaws with some of our competitors and some started a new migration to Webex. This meant we needed to prepare for a second wave of growth.
We had our new scaling machinery well-oiled by this point and leveraged the lessons learned from early March to accelerate our readiness for customers migrating to our secure platform. This included further scaling of our backbone, compute, storage, and public cloud extensions.
The second half of April saw another increase of over 25% of our user base driven by a migration over from competitors to Webex. This increase was seamlessly achieved with our compute, storage, network, database, applications, and media scale-up teams able to scale to the new demand. Improved stability during this second wave is reflected in a decrease in customer-impacting incidents as shown by the graph below.
The Webex process for ensuring service stability coped well, but with enormous growth and high rate of change there was some level of service disruption, predominantly in the critical month of March. In comparison, users of other comparable services experienced substantially more – up to 5x – outages throughout the Mar-May period and beyond.
The month of May landed us at 400% from our February baseline. This became our new plateau before a seasonal slowdown was driven by summer holidays and our education customers decreasing their number of meetings.
Under the Cisco-wide “Day for Me” program, we were given a day off on May 22nd which was for many of us on the Webex team the first day off in 100 straight days of work. It gave us the breather we needed, and it was a great way to celebrate our ability to handle the second wave of scale up gracefully.
Our scale-up efforts were not as smooth in the early days of the shelter in place orders as we would have liked them to be. We hope, through the shared lessons above, that we can help the rest of our peer Site Reliability Engineers, Cloud Engineers, SaaS developers, and IT organizations learn from the processes and tools that helped us achieve what was arguably the biggest scale-up efforts most of us could have ever anticipated.
We also recognize that the story of scaling Webex is only a portion of the journey our customers went through and our IT counterparts in each of our customers’ organizations deserve an incredible amount of recognition for effectively executing on some of the most challenging business continuity plans we will face in our lifetime.
Finally, we at Webex are grateful to have been in a position to assist our customers, schools, hospitals, and governments around the globe in this unprecedented world event.
About the author
Panos Kozanian is a Director of Engineering responsible for the Webex Platform Organization. The Webex Platform Organization is responsible for all infrastructure assets: data centers, compute, storage and network and the PaaS layer that powers all Webex services, as well as collaboration services such as Common Identity, Control Hub and the Analytics Platform. Additionally, the Webex Platform Organization is responsible for reliability engineering, ensuring that Webex continues to be delivered with high availability and world class performance. Prior to this role, Panos led the Webex Teams Platform, establishing a modern DevOps & SRE culture supporting thousands of micro service instances and 1000+ developers. Throughout his career, he has held a number of leadership roles, including forming and leading Cisco’s Business Incubation lab, managing Cisco’s Digital Signage team, and leading Cisco’s Video Portal efforts. Panos joined Cisco in 2003 starting his career working on business incubations and executive demos. Panos earned a Bachelor of Science in computer engineering from Santa Clara University.
Click here to learn more about the offerings from Webex and to sign up for a free account.
Webex Certification assures that a 3rd party vendor product meets or exceeds certain performance standards,…
New features and capabilities to enable a more inclusive way of working
Launched last week, we rolled out a new campaign celebrating the inclusiveness of collaboration with…
Cisco Webex just announced an all new Webex: the next generation collaboration platform. What does…
See how Webex is continuing to use deep learning advancements to enhance our speech technology…
Webex is rolling out the AV1 video codec into production early next year, bringing our…