Chaos Mesh + SkyWalking: Better Observability for Chaos Engineering

This tutorial demonstrates how to combine SkyWalking and Chaos Mesh to better observe the effects of chaos experiments on applications’ service performance.

Chaos Mesh is an open-source cloud-native chaos engineering platform. You can use Chaos Mesh to conveniently inject failures and simulate abnormalities that might occur in reality, so you can identify potential problems in your system. Chaos Mesh also offers a Chaos Dashboard which allows you to monitor the status of a chaos experiment. However, this dashboard cannot let you observe how the failures in the experiment impact the service performance of applications. This hinders us from further testing our systems and finding potential problems.

Apache SkyWalking is an open-source application performance monitor (APM), specially designed to monitor, track, and diagnose cloud-native, container-based distributed systems. It collects events that occur and then displays them on its dashboard, allowing you to observe directly the type and number of events that have occurred in your system and how different events impact the service performance.

When you use SkyWalking and Chaos Mesh together during chaos experiments, you can observe how different failures impact the service performance.

This tutorial will show you how to configure SkyWalking and Chaos Mesh. You’ll also learn how to leverage the two systems to monitor events and observe in real-time how chaos experiments impact applications’ service performance.

Prerequisite

Before you start to use SkyWalking and Chaos Mesh, you have to:

Set up a SkyWalking cluster according to the SkyWalking configuration guide.
Deploy Chao Mesh using Helm.
Install JMeter or other Java testing tools (to increase service loads).
Configure SkyWalking and Chaos Mesh according to this guide if you just want to run a demo.

Step 1: Access the SkyWalking Cluster

After you install the SkyWalking cluster, you can access its user interface. However, no service is running at this point, so before you start monitoring, you have to add one and set the agents.

In this tutorial, we take Spring Boot, a lightweight microservice framework, as an example to build a simplified demo environment.

1. Create a SkyWalking demo in Spring Boot by referring to this document.

2. Execute the command kubectl apply -f demo-deployment.yaml -n skywalking to deploy the demo.

After you finish deployment, you can observe the real-time monitoring results at the SkyWalking UI.

Note: Spring Boot and SkyWalking have the same default port number: 8080. Be careful when you configure the port forwarding; otherwise, you may have port conflicts. For example, you can set Spring Boot’s port to 8079 by using a command like kubectl port-forward svc/spring-boot-skywalking-demo 8079:8080 -n skywalking to avoid conflicts.

Step 2: Deploy SkyWalking Kubernetes Event Exporter

SkyWalking Kubernetes Event Exporter is able to watch, filter, and send Kubernetes events into the SkyWalking backend. SkyWalking then associates the events with the system metrics and displays an overview of when and how the metrics are affected by the events.

If you want to deploy SkyWalking Kubernetes Event Explorer with one line of commands, refer to this document to create configuration files in YAML format, and then customize the parameters in the filters and exporters. Now, you can use the command kubectl apply to deploy SkyWalking Kubernetes Event Explorer.

Step 3: Use JMeter To Increase Service Loads

To better observe the change in service performance, you need to increase the service loads on Spring Boot. In this tutorial, we use JMeter, a widely adopted Java testing tool, to increase the service loads.

Perform a stress test on localhost:8079 using JMeter, and adding five threads to continuously increase the service loads.

The user interface of Apache JMeter

Open the SkyWalking Dashboard. You can see that the access rate is 100%, and that the service loads reach about 5,300 calls per minute (CPM).

SkyWalking Dashboard

Step 4: Inject Failures via Chaos Mesh and Observe Results

After you finish the three steps above, you can use the Chaos Dashboard to simulate stress scenarios and observe the change in service performance during chaos experiments.

CPU Load: 10%; Memory Load: 128 MB

The first chaos experiment simulates low CPU usage. To display when a chaos experiment starts and ends, click the switching button on the right side of the dashboard. To identify whether the experiment is Applied to the system or Recovered from the system, move your cursor onto the short, green line.

During the time period between the two short, green lines, the service load decreases to 4,929 CPM, but returns to normal after the chaos experiment ends.

The service load variation under the first chaos condition

The service load variation under the second chaos condition

CPU Load: 100%; Memory Load: 128 MB

When the CPU usage is at 100%, the service load decreases to only 40% of what it would be if no chaos experiments were taking place.

The service load variation under the third chaos condition

Summary

Because the process scheduling under the Linux system does not allow a process to occupy the CPU all the time, the deployed Spring Boot Demo can still handle 40% of the access requests even in the extreme case of a full CPU load.

By combining SkyWalking and Chaos Mesh, you can clearly observe when and to what extent chaos experiments affect application service performance. This combination of tools lets you observe the service performance in various extreme conditions, thus boosting your confidence in your services.

Relevant blogs

Recent Comments

No comments

Chaos Mesh + SkyWalking: Better Observability for Chaos Engineering