New Solutions Bridge the Gap Between Software and Real-World Impacts

GUEST REVIEW: Kelsey Hightower, a software engineer at Google well known as a Kubernetes advocate, explained Kubernetes’ role in managing and running software by drawing parallels to the postal service.

If you want to ship something from A to B, all you have to do is put it in a box, put an address in it, take it to the post office, and pay a fee. La Poste takes care of all the different forms of transport and the connections between them necessary to deliver your package to its destination, in complete transparency for you, the sender.

If there is an interruption in the sequence of events required to complete this process, again the post office handles this, fairly transparently to you, the sender, except your package may take longer than scheduled to arrive.

The role of Kubernetes

The job of Kubernetes is to manage the execution of an application on multiple systems, in a way that is transparent to the user. In effect, it puts the application in a box and eliminates the complexity of the infrastructure needed to run it.





But that’s not the full picture. In the post office analogy, you can pay extra for priority delivery and you can track the progress of your package, but you can’t see at a granular level what the problem is and you can’t intervene to expedite delivery in the event of a delay. The post office also cannot understand the true priority of your delivery and react accordingly.

This analogy extends to Kubernetes. With a standard Kubernetes platform, it is not always possible to identify where there has been a problem or a delay. Even when information is available, it does not reflect the real impact of the problem: for the IT professional, a blocked application may simply be visible as a red dot on a screen. In the real world, this could mean long queues of customers stuck at a cashless checkout and unable to pay for their groceries electronically.

To avoid such issues, there must be service level objectives (SLOs) for software that are reflected in performance goals for individual applications with observability solutions capable of reading log data, monitoring application performance, translate any degradation into impact on the associated SLOs, and inform IT teams of the real impacts of these problems.

Translate software issues into real impacts

Modern and more advanced observability platforms take things one step or, in some cases, two steps further. They provide the information needed to analyze the problem in the context of the SLOs and make recommendations as to the most appropriate action IT teams can take to restore the service level to its SLO. These platforms can also allow developers to create an automated response triggered under specified conditions.

And in today’s world, with the heavy reliance on cloud computing, an end-user experience is highly likely to rely on applications running on multiple cloud services. Observability solutions must therefore be able to see through these multiple systems and translate any type of problem into its impact on the service, of which it may be only one element among others.

An example of such a solution is the open source Keptn control plane, which was contributed to the Cloud Native Computing Foundation by Dynatrace. Keptn provides automated configuration of observability solutions, dashboard creation and SLO-based alerts.

This makes life easier for developers as they can now create code, define what resources they want, and what actions need to be performed to bring a system back to a healthy state when things go wrong. And being open source, it is now at the center of a growing ecosystem of contributors improving its functionality.