CHAOS IN PRACTICE

jasbirs · ‎12-16-2022

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

There are numerous ways in which applications could fail, servers heating, disks failing all the time, network connections becoming flaky. We assume that since we have replicated servers, backed up our database instances, applications have been spread across multiple regions/zones, are prepared to handle disaster scenarios.

But are we really sure that our distributed systems/applications are resilient? The only way to prove that your systems/applications are resilient to failure is to experience failure and to make swift responsiveness to failure an integral part of your software/system/applications.

Chaos engineering is the practice of routinely testing your system’s resilience by inducing controlled failures.

CHAOS IN PRACTICE

To specifically address the uncertainty of distributed systems/applications at scale, Chaos Engineering can be thought of as the facilitation of controlled experiments to uncover systemic weaknesses. These experiments follow four steps:

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
Hypothesise that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.

There are different tools available in the market with different maturity levels to experiment chaos with distributed systems/applications. I have decided to compare three different tools in this blog post. All these tools are CNCF projects.

I would comparing these three tools across following categories: Installation & Management, Experiment definitions & Variety, Security, and Observability.

Tools Comparison

Chaos Toolkit

Chaos Toolkit focusses a lot on extensibility, and aims to become the framework to create custom chaos tools and experiments. It embraces the full lifecycle of experiments, making it possible to run checks (which are called probes) at the beginning of an experiment to check the state of a target application, followed by actions against the system to cause instability, and verifying if the expected final state is achieved. It allows to declare and store your Chaos Engineering experiments as JSON/YAML files so you can collaborate and orchestrate them as any other piece of code(Chaos as Code). With driver extensions, like the AWS Driver or the Kubernetes Driver, which can be easily installed to facilitate the use of additional actions against an extended list of target platforms. New custom drivers can be created, or the existing ones can be enhanced, as a way to have more types of probes and actions available for experiments.

Installation and management

Python Requirements

The chaostoolkit CLI is implemented in Python 3 and this requires a working Python installation to run. It officially supports Python 3.7+. It has only been tested against CPython.

Create a virtual environment

Dependencies can be installed for your system via its package management but, more likely, you will want to install them yourself in a local virtual environment.

python3 -m venv ~/.venvs/chaostk

Make sure to always activate your virtual environment before using it:

source  ~/.venvs/chaostk/bin/activate

Install the CLI

Install chaostoolkit in the virtual environment as follows:

pip install -U chaostoolkit

You can verify the command was installed by running:

chaos --version

Deploy Chaos Toolkit as a Kubernetes Operator

Kubernetes operators are a popular approach to create bespoke controllers of any application on top of the Kubernetes API.

The Chaos Toolkit operator listens for experiment declarations and triggers a new Kubernetes pod, running the Chaos Toolkit with the specified experiment.

Deploy the operator

The operator can be found on the Chaos Toolkit incubator.

It is deployed via typical Kubernetes manifests which need to be applied via Kustomize, the native configuration manager.

First, download the Kustomize binary:

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash

For macOS, you can also install it via the Homebrew package manager:

brew install kustomize

Next, simply run the following:

kustomize build manifests/overlays/generic-rbac | kubectl apply -f -

Experiment definition and variety

The Chaos Toolkit aims to give you the simplest experience for writing and running your own Chaos Engineering experiments. The main concepts are all expressed in an experiment definition, of which the following is an example.

{
  "version": "1.0.0",
  "title": "System is resilient to provider's failures",
  "description": "Can our consumer survive gracefully a provider's failure?",
  "tags": [
    "service",
    "kubernetes",
    "spring"
  ],
  "configuration": {
    "app_name": {
      "type": "env",
      "key": "LABEL_NAME"
    },
    "name_space": {
      "type": "env",
      "key": "NAME_SPACE"
    }
  },
  "steady-state-hypothesis": {
    "title": "Killing the pod where application is running",
    "probes": [
      {
        "type": "probe",
        "name": "there-should-be-at-least-2-running-app-replicas",
        "tolerance": 3,
        "provider": {
          "type": "python",
          "module": "chaosk8s.pod.probes",
          "func": "count_pods",
          "arguments": {
            "label_selector": "app=${app_name}",
            "ns": "${name_space}"
          }
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "Terminate_pod",
      "provider": {
        "type": "python",
        "module": "chaosk8s.pod.actions",
        "func": "terminate_pods",
        "arguments": {
          "label_selector": "app=${app_name}",
          "name_pattern": "${app_name}",
          "ns": "${name_space}",
          "rand": true,
          "mode": "fixed",
          "qty": 1
        }
      },
      "pauses": {
        "after": 20
      }
    }
  ],
  "rollbacks": []
}

The key concepts of the Chaos Toolkit are Experiments, Steady State Hypothesis and the experiment’s Method. The Method contains a combination of Probes and Actions.

Security

To watch and manage its own CRDs, the Chaos Toolkit operator needs a service account with enough privileges to do its job. For instance, to run a simple experiment to delete an application pod in a given namespace, the operator will create a chaos toolkit pod using a service account with enough permissions to delete pods.

Any specific network access or more elevated privileges may be required depending on which additional drivers will be used. This modular approach makes it easier to keep things secure, as one can pick or develop drivers that match their own requirements.

Observability

Chaos Toolkit has a Prometheus driver to export metrics and events from the experiments. It also has an Open Tracing driver as well as a Humio one. However, the tool does not yet provide a standardised report of the experiment results, which means that the way to observe the flow of the experiment is by checking the logs of Chaos Toolkit itself.

Litmus Chaos

LitmusChaos is a Cloud-Native Chaos Engineering Framework with cross-cloud support. It is a CNCF Sandbox project with adoption across several organizations. Its mission is to help Kubernetes SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments.

Installation and management

Prerequisites

Before deploying LitmusChaos, make sure the following items are there

Kubernetes 1.17 or later
A Persistent volume of 20GB
Helm3 or kubectl

Install Litmus using Helm

The helm chart will install all the required service account configuration and ChaosCenter.

The following steps will help you install Litmus ChaosCenter via helm.

Step-1: Add the litmus helm repository

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo list

Step-2: Create the namespace on which you want to install Litmus ChaosCenter

The ChaosCenter can be placed in any namespace, but for this scenario we are choose litmus as the namespace.

kubectl create ns litmus

Step-3: Install Litmus ChaosCenter

helm install chaos litmuschaos/litmus --namespace=litmus --set portal.frontend.service.type=NodePort

Expected Output

Experiment definition and variety

The interesting part about Litmus is that it provides a well-defined way to choose your own experiment runner. It uses the concept of chaos libraries that define the packages to be used for the execution of the experiment.

This makes Litmus a very extensible and tool-agnostic framework, instead of just another chaos injection tool.

The experiment execution is triggered upon creation of the ChaosEngine resource. Typically, these chaosengines are embedded within the ‘steps’ of a Litmus Chaos Workflow. However, one may also create the chaos engines directly by hand, and the chaos-operator reconciles this resource and triggers the experiment execution.

Security

Litmus requires a well-defined set of cluster role permissions. Additionally, a prerequisite for every experiment is for the experiment-specific service account, role, and role binding objects to exist in the target namespace. Litmus provides a thorough way of identifying the target workloads, starting from the higher-level object and finishing on the pod level. This serves well in limiting the blast radius and ensuring that chaos is injected only on the intended workloads.

Litmus is a multi-faceted framework with different layers that all need the appropriate attention from a security standpoint.

Observability

The reporting side of Litmus is driven mainly by the chaosresult Custom Resource. This is a customisable object that can be enhanced with more details about the experiment. However, at the moment it provides very simple information, mainly around the status of the experiment by displaying important events and eventually its result.

Chaos Mesh

Chaos Mesh is an open source cloud-native Chaos Engineering platform. It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios. Using Chaos Mesh, you can conveniently simulate various abnormalities that might occur in reality during the development, testing, and production environments and find potential problems in the system. To lower the threshold for a Chaos Engineering project, Chaos Mesh provides you with a visualization operation. You can easily design your Chaos scenarios on the Web UI and monitor the status of Chaos experiments.

Installation and management

Install Chaos Mesh using Helm(Recommended for Production Deployments)

Step 1: Add Chaos Mesh repository

Add the Chaos Mesh repository to the Helm repository:

helm repo add chaos-mesh https://charts.chaos-mesh.org

Step 2: View the installable versions of Chaos Mesh

To see charts that can be installed, execute the following command:

helm search repo chaos-mesh

Step 3: Create the namespace to install Chaos Mesh

It is recommended to install Chaos Mesh under the chaos-mesh namespace, or you can specify any namespace to install Chaos Mesh:

kubectl create ns chaos-mesh

Step 4: Install Chaos Mesh in different environments

You can execute the following installation commands according to different environments.

Docker

# Default to /var/run/docker.sock
helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --version 2.5.0

Containerd

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock --version 2.5.0

K3s

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock --version 2.5.0

CRI-O

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --set chaosDaemon.runtime=crio --set chaosDaemon.socketPath=/var/run/crio/crio.sock --version 2.5.0

Verify the installation

To check the running status of Chaos Mesh, execute the following command:

kubectl get po -n chaos-mesh

The expected output is as follows:

NAME                                        READY   STATUS    RESTARTS   AGE
chaos-controller-manager-69fd5c46c8-xlqpc   3/3     Running   0          2d5h
chaos-daemon-jb8xh                          1/1     Running   0          2d5h
chaos-dashboard-98c4c5f97-tx5ds             1/1     Running   0          2d5h

Experiment definition and variety

The list of chaos types are grouped in the following categories: network, pod, I/O, time, kernel and stress, each one with its own CRD type. They all share a common selector entry as a way to find target pods, besides the optional duration or recurrent scheduling of the desired chaos. You can create experiments using the chaos dashboard or in yaml format as well.

Create Experiments Using Chaos Dashboard

Create experiments using YAML configuration files

pod-failure example

Security

Chaos Mesh also uses some Linux utilities to implement the low-level chaos types. Similarly, it needs to use the Docker API in the host machine. Therefore, the daemon Pods (deployed as DaemonSet) will run as privileged containers, and will mount the /var/run/docker.sock socket file. The controller manager Pod will require permissions to manage MutatingWebhookConfiguration, besides some other expected role-based access control (RBAC) permissions, if the sidecar injection is enabled.

Observability

The main project repository mentions a chaos dashboard side project, but it seems it works exclusively for tests with their database product. Building a more generic dashboard project is on the roadmap. So far, the state of chaos experiments can be monitored by inspecting the Custom Resources objects in the cluster.

Key Takeaways

We are able to categorize chaos-engineering tools either as Chaos orchestrators like Litmus and Chaos Toolkit being the prominent ones, and as chaos injectors like Chaos Mesh. The chaos orchestrators aim to provide well-defined experiments using proper chaos engineering principles. Litmus is a more complete framework that still provides extensibility, while Chaos Toolkit aims to become the standard API to define experiments.

The chaos injectors focus on the execution of experiments. Chaos Mesh streamlines the execution of experiments in Kubernetes out-of-the-box.

Depending if you need an executor or an orchestrator, there are a lot of open-source options available, all with their own advantages and disadvantages.

Comparing CNCF Chaos Engineering Tools

CHAOS IN PRACTICE

Tools Comparison

Chaos Toolkit

Installation and management

Python Requirements

Create a virtual environment

Install the CLI

Deploy Chaos Toolkit as a Kubernetes Operator

Deploy the operator

Experiment definition and variety

Security

Observability

Litmus Chaos

Installation and management

Prerequisites​

Install Litmus using Helm​

Step-1: Add the litmus helm repository​

Step-2: Create the namespace on which you want to install Litmus ChaosCenter​

Step-3: Install Litmus ChaosCenter​

Experiment definition and variety

Security

Observability

Chaos Mesh

Installation and management

Install Chaos Mesh using Helm(Recommended for Production Deployments)​

Step 1: Add Chaos Mesh repository​

Step 2: View the installable versions of Chaos Mesh​

Step 3: Create the namespace to install Chaos Mesh​

Step 4: Install Chaos Mesh in different environments​

Docker​

Containerd​

K3s​

CRI-O​

Verify the installation​

Experiment definition and variety

Create Experiments Using Chaos Dashboard​

Create experiments using YAML configuration files

pod-failure example

Security

Observability

Key Takeaways

Prerequisites

Install Litmus using Helm

Step-1: Add the litmus helm repository

Step-2: Create the namespace on which you want to install Litmus ChaosCenter

Step-3: Install Litmus ChaosCenter

Install Chaos Mesh using Helm(Recommended for Production Deployments)

Step 1: Add Chaos Mesh repository

Step 2: View the installable versions of Chaos Mesh

Step 3: Create the namespace to install Chaos Mesh

Step 4: Install Chaos Mesh in different environments

Docker

Containerd

K3s

CRI-O

Verify the installation

Create Experiments Using Chaos Dashboard