A Promising New Approach to Reliability Testing for Cluster Management Controllers

Cluster managers, such as Kubernetes, Borg, and Omega, are essential components of many modern critical computing systems. However, the individual MCs they rely on to operate face reliability issues that can lead to data loss, security vulnerabilities, and resource leaks. This underscores the importance of controller reliability testing, which is usually controller-specific and requires expert guidance in the form of formal specifications or carefully crafted test entries.

In a paper presented this week at the 2022 USENIX Symposium on Operating System Design and Implementation (OSDI), we describe Sieve, the first automatic and generalizable reliability testing technique for cluster management controllers. The article was co-authored with lead author Xudong Sun, a VMware Research intern and PhD student at the University of Illinois at Urbana-Champaign (UIUC), along with Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, and me -even from VMware and Wenqing Luo, Jiawei Tyler Gu and Tianyin Xu from UIUC.

Sieve is controller independent, does not require a formal cluster manager or controller specification, and does not need to be directed to any specific vulnerability area. With just a manifest to create the controller image and a set of basic test workloads, Sieve can automatically and efficiently test controllers for otherwise hidden reliability issues. During an initial evaluation, we ran Sieve on 10 popular open-source Kubernetes controllers and found a total of 46 bugs. Since their reporting, we have confirmed 35 of these bugs. Notably, many were deep semantic bugs with potentially serious consequences for system reliability, data loss, and security. Sieve detected them all without expert help.

Exploiting the principle of reconciliation between States

Cluster management controllers are usually responsible for a specific function in their corresponding cluster manager. They rely on the state-reconciliation principle, where each controller independently observes the current state of the cluster and issues corrective actions to converge the cluster to a desired state. Unfortunately, since each controller is a unique component in an extremely complex distributed system, it is nearly impossible to predict every situation that a specific controller might respond to. This makes it extremely difficult to create controllers that we can have complete confidence in, even when they are responsible for critical functions. This, in turn, amplifies the importance of regular reliability testing, which has typically been difficult to direct, generalize, or automate.

Sieve’s design is fueled by a fundamental observation about state reconciliation systems: they rely on relatively simple and highly transparent state-centric interfaces between controllers and the main cluster manager. These interfaces perform semantically simple operations on the cluster state (for example, reads and writes) and deliver notifications about cluster state changes. Their simplicity and transparency allow us to build a single tool that can autonomously test many controllers – and automatically detect a wide range of bugs – without needing to know what the controllers are doing.

Here’s how it works: we run a set of test workloads and track the resulting activity at those interface boundaries, then identify promising locations to deliberately inject a single error into the runtime. Sieve then runs the same test workload again, this time with the fault strategically injected into the runtime. When this injection creates a different resulting trace, we have a strong indicator of the existence and likely location of a potential bug, without the need for semantic information about the workload we are tracing.

As such, Sieve works without the need to formally specify the controller or cluster manager, make assumptions about where bugs reside in code, or use highly specialized test inputs. Nor does it rely on claims written by experts. All it needs is a manifest to create the controller image and basic test workloads. After that, Sieve’s tests are fully automated and reproducible. This degree of user-friendliness is key to making reliability testing widely accessible to the rapidly growing number of custom controllers.

We evaluated Sieve against 10 popular Kubernetes ecosystem controllers for managing widely used cloud systems, including Cassandra, Elasticsearch, and MongoDB, using between two and five baseline test workloads for each controller. It took us an average of three hours to apply Sieve to each controller, although much of that time was spent figuring out how to build the controller. In this assessment, Sieve found new bugs in each controller, for a total of 46 security and liveness bugs (as I mentioned earlier – with 35 already confirmed and 22 fixed) with a low false positive rate 3.5%.

It’s worth re-emphasizing that these bugs had serious potential consequences, ranging from application crashes to security vulnerabilities, resource leaks, and data loss.

A new tool for controller development

While it was important to us to be able to catch previously unidentified bugs in existing and widely deployed controllers (and while we envision Sieve being used to regularly test the reliability of controllers already in operation), our The larger goal was to help the controller-development process, increasing reliability as controllers are written and before they start running critical workloads.

To that end, we’ve released Sieve’s test code and workloads (see https://github.com/sieve-project/sieve), along with instructions on how to reproduce any bugs we we discovered.

This work began as a research project designed with Xudong during his first internship at VMware Research in 2020. It’s a great example of the kind of groundwork that can come from VMware academic/industry collaborations. I’m also thrilled that Xudong is interning with us again this summer, where he’s looking to extend that work by making Sieve easier and faster to use and more easily integrated into development pipelines.

VMware Research is always interested in projects that explore new classes of systems in promising areas that are both difficult to address and have not yet received as much attention as they deserve. While we’re focusing here on controllers that work with Kubernetes, we’re also thinking about how we can use a similar methodology in the context of other modern state-centric interfaces, and how what we’ve learned can be applied. used to improve the reliability of other types of applications.

If you want to learn more about the research project that led to Sieve, check out Xudong’s presentation and my presentation at KubeCon 2021North America on “Automated and Distributed Systems Testing for Kubernetes Controllers”. I also had a great conversation on the subject (and how it fits into our broader research interests) with Sudesh Girdhari from VMware’s CloudStream YouTube channel.