What is a workflow?

Workflow, at its core, is a set of steps executed reliably irrespective of how long it takes to complete them. The platform would guarantee durable execution, state recovery, and resumption if the step failed for any reason. This fault tolerance is achieved by implementing the concept of 'durable function' popularized by Azure. The fault-tolerant and stateful execution simplifies the life of automation developers. Implementing human-in-loop and other long-running steps is just a matter of 1 line code. Orchestrator does the heavy lifting of suspending and resuming the workflow program while waiting for human input. All in-memory data is automatically persisted in durable storage so the program can continue after a fault without the developers needing anything special in code. Similarly, there is no need to persist any data to make the program fault tolerant.

There are a few popular general-purpose workflow platforms available in the market, like Apache airflow, Netflix conductor, and Temporal. Exhaustive comparison among these tools is out of the scope of this document; however, they differ in scalability and abstractions available to the developer. Out of these, we think Temporal is the leading workflow platform for today's developers, so we built the Maira cloud orchestrator around that.

With a traditional workflow platform, developers require to implement their logic in terms of predefined DAGs. DAGs are a collection of tasks stitched together in a dependency graph. They are either expressed in Python or constructed in UI. This unnatural rethinking in terms of DAG and steps is a significant roadblock in automation velocity and maintainability. Such low code products are not useful in fast paced ops environment.

Maira workflow benefits

Compared to existing low code and workflow platforms, Maira has the following benefits:

Create workflows using simple Python-like code. No need to think in terms of steps or DAG. Write simple imperative code, and get workflows.
No compromise on the low-code promise for non-developers. Use the drag-and-drop interface if desired.
A secure self-service platform with a rich set of integrations at your fingertips. Interact with apps you already use, from k8s, Aws, Datadog, and Elastic to Slack and Email.
Managed orchestration, on-prem, cloud, or SaaS.
Developer friendly Repl shell with builtin CLI support. Fast troubleshooting using CLIs in the terminal you love backed by fine-grain access control. Resume and audit any old repl session.
Low learning curve -- expert Python developers are not required. Easy syntax. Much smaller code, as complexities are absorbed in integration CLIs.

Durable execution internals

Internally the Maira workflow orchestrator (powered by Temporal) executes every external system calls as durable function. Below diagram shows how mpl code is executed in stateful manner. As you can see that program execution is backed by a database where non re-entrant states (marked in red) are automatically saved. So when the program is recovering from failure, instead of re-executing those statements in red, we use the results saved in the db and continue executing till we hit the failure point and then the program continues executing normally.

However, the developer does not have to think about steps or durable function calls with Maira. The language understands what commands and built-ins (like the time statement now above) are to be executed in a durable manner. This stateful execution of MPL code has the following benefits:

Error handling is not needed in most cases as retries are part of stateful execution.
Orcehstrator can suspend and resume and long running call. Long-running could be 1 second to more than a year. The developer doesn't need to do anything special.
Human-in-loop implementation is a single line command confirm x.
In-memory code -- no need to save states in db in most cases. Developers familiar with writing applications with REST paradigm is used to holding states in db, but here, entire business logic can be written as monolith code without worrying about failures during the multi-step app.

Examples

Handling security incidents

Secops teams receive many incidents every day about password compromise etc. Responding to each of these manually is impossible. This workflow automates the process, where, the impacted user is automatically notified on slack This workflow is triggered on receiving notification from the SIEM system and all incidents are handled at scale, taking inputs from the affected party and security admin.

msg = "Your user account had a malicious login attempt. What would you like to do?"
options = json ["disable", "reset password", "ignore"]
slack_user = "@"+input.user
x = ""
bm = "Malicious login attempt for user "+input.user
msg = bm +"\nNo response from user"
x = !slack ask --message msg --name slack_user --options options --timeout "10 s" --retry_interval "5 s"
# @label: user wants to disable?
if x == "disable":
    !user disable input.user
    msg = bm +"\nUser account disabled"
    y = !create ticket msg

# @label: user wants passwd reset?
if x == "reset password":
    !user reset-password input.user
    msg = bm +"\nUser password reset"
    y = !create ticket msg
# @label: user wants to ignore?
if x == "ignore":
    msg = "Malicious login event ignored"
    y = !slack send-message --name slack_user --message msg

Block diagram

Note that multiple instances of this workflow can be running with its own states with multiple user interacting with them. This is where the scalability and reliablity of workflows makes life easier for automation and tools developers.

Delete unattached aws volume

Below is an example of cost saving through a continuous monitoring app. This is a janitor workflow, scheduled periodically by the orchestrator, that can detect and clean up detached ec2 volumes. We iterate on the list returned by the ec2 list-volumes command. Even if the program dies while processing the list, it can resume processing. Here the workflow will terminate only in case several retries have failed while executing the CLI. The developer need not handle any failure scenario, as workflow can be resumed when system and network recovers. Note that with Maira's low-code platform, it's only two lines of code even if we could be handling 1000s of unused resources.

# find all volumes 'available' to be attached.
vols = !ec2 list-volumes --state "available"
# @label: for all unattached volumes
for vol in vols.volumes[].volume_id:
    !ec2 delete-volume vol

Block diagram

BGP flap

Collect real-time data -- This playbook monitors bgp flaps; if it happens constantly, it sends a report to admin with data collected for each in real-time for each flap. Here the data (ping and route information when a flap is detected) is collected overtime, in memory, without worrying about failures or losing data. Essentially the developer writes simple monolith code and gets a fail-proof long-running app.

# @label: Find flap count for peer
def find_flap_count(info, peer):
    # @label: loop through the peers to find the specified one
    for bgp_peer in info[0:0][0][0:0]:
        # @label: peer address matches?
        if bgp_peer[0:0][0].data == peer:
            return bgp_peer["flap-count"][0].data.num()

def get_data(peer, flap_count):
    # collect system state in real time.
    routes = ! junos show-route --site input.site --cluster junos-default peer --output-format "display"
    ping = ! junos ping --site input.site --cluster junos-default peer --count 3 --output-format "display"
    ts = now
    msg = "Timestamp: " + ts.string() + " Flap count: " + flap_count
    msg = msg + "\nRoute to this peer:\n" + routes
    msg = msg + "\nPing to this peer:\n" + ping
    return msg

flap_threshold = 2
wait_period = 30
peer = input.peer
x = ! junos show-bgp-summary
history = json []
start_time = now
initial_flap_count = find_flap_count(x, peer)
data = get_data(peer, initial_flap_count)
history.append(data)
prev_flap_count = initial_flap_count
# @label: forever loop
for:
    wait wait_period
    current = ! junos show-bgp-summary
    current_flap_count = find_flap_count(current, peer)
    if current_flap_count > prev_flap_count:
        data = get_data(peer, current_flap_count)
        history.append(data)
        prev_flap_count = current_flap_count
    flap_count = current_flap_count - initial_flap_count
    if flap_count > flap_threshold:
        end_time = now
        d = end_time - start_time
        msg = "BGP peer "+ peer + " had " + flap_count + " flaps in " + d.shortstring() + "\n"
        msg = msg + "\n".join(history)
        ! slack send-message --name "#general" --message msg
        break

Block diagram