Introducing Maira

Maira is a low-code workflow platform built for SRE and Ops teams to help them automate their high-toil day-2 activities. day-2 of operations includes everything needed to operate a cloud application optimally after the deployment phase is over -- from troubleshooting, security, and compliance to cost control. Using Maira, the operations team (CloudOps, SRE, SecOps, NetOps, or anyOps) or developers can rapidly develop sophisticated human-in-loop stateful automation as code.

No team can succeed in the automation journey if it remains a parallel activity and responsibility of select few team members. Automation scripts will always lag or be buggy if users interact with the system using different tools for their daily troubleshooting and operational needs. An automation platform must enable the entire organization to be on the same page and work collaboratively with the same tools, even for interactive manual activities. In general, ops automation is built over time as the application evolves and the team gains operational experience. The platform must make it super easy to automate the steps tried manually, as experiences gained from these sessions ultimately result in automation. This is why besides automation, manual troubleshooting, access control, and observability, etc., must converge on a common platform.

Today the line between dev and various ops teams is constantly blurring. We built Maira grounds up to be a centralized platform for modern ops to help engineers on the frontline wearing multiple hats. With the right privilege grants and audit capabilities, the core ops team retains complete control and visibility into the production system while empowering distributed citizen developers and part-time operators across teams to contribute. Maira improves the security and compliance of the production environment while reducing knowledge silos. An effective shift-left strategy removes bottlenecks in ops and helps service owners operate their apps optimally, reducing busy work for everyone.

Bridging the gap between Dev-friendly and Low Code.

Low-code platforms are typically built for non-developers to enable them to automate a task without involving developers or relying on IT departments. These platforms support a visual drag-and-drop interface and pre-built feature blocks. Such products are effective in environments with low development volume, complexity, and security concerns. In the Ops environment, automation scripts frequently churn as application complexity grows. Hence, a typical low-code platform (backed by YAML, etc.) unsuitable for such use cases as boxes the developers in a visual interface. Developing anything beyond a few steps is painful or impossible in UI. Imagine making sense of a large script in a graphical interface with many paths. Version control and other production code maintainability are problematic in the drag-and-drop world.

Maira is architected to provide the full power of a high-level language. It allows developers to create 'stateful workflows' using a Python-like a domain-specific language (DSL) called MPL. MPL provides a sophisticated development environment without compromising on the benefits of workflows. Today ops teams are used to the power of a general-purpose language like Python that allows them to create complex scripts. Now with MPL, the same development flexibility is available, and with out of box integrations, one can develop fail-proof, scalable, long-running applications with much less code. Non-programmers can also use the familiar drag-and-drop visual interface for development. Maira's hosted workflow orchestrator handles scheduling, scaling, fault-tolerance, etc., to execute the workflows reliably without operational overhead.

Use cases

Operators use Maira for various use cases. Some of them are listed below. Maira is evolving continuously with new integrations and language capabilities. Feel free to contact us for any automation use cases you may have.

Manual and automated troubleshooting.
Trigger workflows on receiving events. Collect system state in real-time.
Automatically respond to security events. Collect forensic data and take action based on the impacted user's input.
Periodic compliance tasks like certificate rotation, password reset, etc.
Manage cloud costs by continuously monitoring, notifying, and deleting resources upon approval.
Various human-in-loop automation tasks using notification and survey features on slack.
Segmenting access to the production cluster within the team using Maira's deep RBAC capabilities.
Upgrade or deploy security patches on a large number of target devices.
Generate periodic reports as needed for compliance and user tracking.
Build internal tools.

Maira tour

Let's take a quick tour of the platform by reviewing a case that operators may carry out. Here, the operator is experimenting with chaos testing, so they want to periodically delete a pod where the container has agent in its image name. For this, typically following activities will be carried out:

interactively root cause and try actions.
Automate the resolution steps.
Deploy the automation to run periodically.

Let's go through this journey with Maira. We will assume that Maira is already deployed as described here.

Interactive troubleshooting

The best way for an operator to troubleshoot the application is via MPL repl. This repl shell supports all the available Maira integrations with other apps and makes them available as CLIs. Here we will use k8s CLI as shown below.

mpl> !k8s list-pods
NAME                            STATUS  RESTARTS        CONTAINERS      AGE
hello-python-75b7fb8899-n2fzc   Running                 hello-python    2d20h
hello-python-75b7fb8899-qz5d7   Running                 hello-python    2d20h
r1-maira-agent-69c87875b9-pq2dd Running                 maira-gateway   2d20h

In MPL, all integrations as available as CLI and are entered with ! prefix. Most of the get commands like the above prints output human-friendly tabular format. To programmatically access CLI output fields, traversing JSON is the best way. For that, in MPL, any CLI output can be assigned to a variable, as shown below.

mpl> x = !k8s list-pods
mpl> print x
[
    {
        "cluster": "in-cluster",
        "namespace": "default",
        "name": "hello-python-75b7fb8899-n2fzc",
        "containers": [
            {
                "name": "hello-python",
                "image": "gcr.io/macro-context-293714/hello-python:latest",
                "state": 3,
                "ready": true,
                "StateDetails": {
                    "StateRunning": {}
                }
            }
        ],
        "node_ip": "10.128.15.203",
        "pod_ip": "10.44.0.14",
        "state": 3,
        "labels": {
            "app": "hello-python",
            "pod-template-hash": "75b7fb8899"
        },
        "pod_conditions": [
            {
                "type": 2,
                "status": 1
            },
            {
                "type": 3,
                "status": 1
            },
            {
                "type": 1,
                "status": 1
            },
            {
                "type": 4,
                "status": 1
            }
        ],
        "status": "Running",
        "start_time_millis": 1677207124000
    },
    {
        "cluster": "in-cluster",
        "namespace": "default",
        "name": "hello-python-75b7fb8899-qz5d7",
        "containers": [
            {
                "name": "hello-python",
                "image": "gcr.io/macro-context-293714/hello-python:latest",
                "state": 3,
                "ready": true,
                "StateDetails": {
                    "StateRunning": {}
                }
            }
        ],
        "node_ip": "10.128.15.201",
        "pod_ip": "10.44.3.4",
        "state": 3,
        "labels": {
            "app": "hello-python",
            "pod-template-hash": "75b7fb8899"
        },
        "pod_conditions": [
            {
                "type": 2,
                "status": 1
            },
            {
                "type": 3,
                "status": 1
            },
            {
                "type": 1,
                "status": 1
            },
            {
                "type": 4,
                "status": 1
            }
        ],
        "status": "Running",
        "start_time_millis": 1677206678000
    },
    {
        "cluster": "in-cluster",
        "namespace": "default",
        "name": "r1-maira-agent-69c87875b9-pq2dd",
        "containers": [
            {
                "name": "maira-gateway",
                "image": "gcr.io/maira-public/maira-agent:latest",
                "state": 3,
                "ready": true,
                "StateDetails": {
                    "StateRunning": {}
                }
            }
        ],
        "node_ip": "10.128.15.202",
        "pod_ip": "10.44.2.4",
        "state": 3,
        "labels": {
            "app.kubernetes.io/instance": "r1",
            "app.kubernetes.io/name": "maira-agent",
            "pod-template-hash": "69c87875b9"
        },
        "annotations": {
            "kubectl.kubernetes.io/restartedAt": "2022-11-09T12:50:19-08:00"
        },
        "pod_conditions": [
            {
                "type": 2,
                "status": 1
            },
            {
                "type": 3,
                "status": 1
            },
            {
                "type": 1,
                "status": 1
            },
            {
                "type": 4,
                "status": 1
            }
        ],
        "status": "Running",
        "start_time_millis": 1677206896000
    }
]

Note that when CLI output is assigned to a variable (x in the above example), it is of type JSON with a value of CLI output received from the backend. JSON is first class object type in MPL. If the CLI is entered without assigning to a variable, MPL repl assumes that a tabular output is desired, so it renders the JSON into a tabular format. This feature makes troubleshooting any app much easier. There is no need to parse through complex JSON during an outage scenario.

Maira supports intuitive syntax to access JSON fields. E.g. below, we are going through all array elements in x by not specifying the index (x[]), and then for each element, we are accessing the name field at index 0 in the containers array. This is basically printing the CONTAINERS column in the table output.

mpl> print x[].containers[0].name
"hello-python",
"hello-python",
"maira-gateway"

Next we have to get the name of the pod which contains "agent" in the image name. For this, we filter those containers that have the image field value containing string agent. This is accomplished by the supported filter operator, [? ..] where we specifiy comparator == which matches a glob, (speciefied within backtics), as below. More on this in later chapters.

mpl> x[].containers[? .image == `*agent*`].name
"maira-gateway"

As simple as that. Now, we simply delete the pod.

mpl> p = x[].containers[? .image == `*agent*`].name
mpl> !k8s delete-pod p

But what if we want an approval before deleting the pod? For this we simply use the confirm command before deploying the workflow. The confirm command sends an email to a pre configured email list and waits for the recepients to click on approval link. The workflow orechestrator will simply wait for the approval to arrive.

msg = "can i delete pod: " + p
z = ! confirm msg
if z == "confirmed":
    ! k8s delete-pod p

And may be send a slack confirmation if needed.

msg = "delete pod: " + p
! slack send-message --name "#general" --message msg

Creating a workflow

We saw the troubleshooting steps executed in MPL repl shell. Lets say after few days we decide to automate those steps and deplpy them as a workflow. MPL supports resuming any session from the past, with all prev program state in the repl intact. e.g., below you can see that previous variables p or x are intact. (Note: -r resumes the previous session if no session-id is provided)

> mpl -r

mpl version: 0.9 
Type "/help" for more information.
Please use "ctrl-d" to exit this program.
Resumed last session: 63efe18f38d203d7911576f2
mpl> print p
"maira-gateway"
mpl> print x[].containers[].name
"hello-python",
"hello-python",
"maira-gateway"

Next we can add few commands if desired to session and then use /save-session command to save the session in local file and edit it in a text editor/IDE. Following is edites session file to be deployed as workflow.

> cat session.mpl
x = !k8s list-pods
p = x[].containers[? .image == `*agent*`].name
msg = "can i delete pod " + p
# @label: Got approval?
z = ! confirm message
msg = "Deleted pod: "+p
# @label: If approved?
if z == "confirmed":
    ! k8s delete-pod p
else:
    msg = "Did not delete pod: "+p
! slack send-message --name "#general" --message msg

Note that the exact repl session dump is our workflow. No need to replicate the logic and CLIs in a Python or JS. What you try is what you get. The CLIs are directly usable in the MPL script, with common Python syntax and integration features available in MPL, creating workflow has neven beein easier. The same script is executed on the cloud orchestrator as stateful workflow. No need to import any package or worry about persisiting data to recover from failure. Suppose the program dies because of infra failure while waiting on confirm command, the data in x is automatically stored and the program will be recovered and resumed as if no failure happened.

Everything you need for automation is available in MPL. No code bloat. Even a high schooler can make sense of this code. An equivalent Python script will be at least 5x more complex and harder to develop and maintain.

Deploying workflow

Deploy the above workflow scripts simply as:

>  mairactl playbook create --name "my-first-workflow" -f ./session.mpl

This deployed workflow can be triggered on receving an event on our webhook endpoint, run periodically or manually. The deployed worklow looks as below.

If one so desires, the exact workflow can be created via visual UI interface via drag-and-drop method as well. Each integration CLI is available as visual block.

Command completion

REPL shell has command completion built in. This is immense time saver for operators in war room situations. CLIs args can be completed dynamically by fetching args in the background. Press ctrl-s to fetch args.

Slack

Operators can execute CLIs from slack as well. The concept of session remains valid here as well. When the user enters a CLI, its executed in the same manner as in REPL shell with all variable state intact. Internally a separate session called slack is used to statefully execute CLIs after restoring previous state. This gives a true ChatOps capability for the operators. For more details refer to SlackOps section.