Event-Driven Ansible for Alert Notifications

This article shows you how to receive ThousandEyes alert notifications in Event-Driven Ansible (EDA), part of Ansible Automation Platform, using custom webhooks.

Introduction

Event-Driven Ansible (EDA) automatically executes actions, such as running an Ansible playbook, in response to events in a system or network. Using EDA, automation activities can be triggered based on specific events, such as a system failure, network change, or a new device being added. This approach allows for real-time response and proactive management of IT infrastructure. Instead of running Ansible scripts manually or on a schedule, the system can react to events and make necessary adjustments automatically. This can greatly improve the efficiency and responsiveness of IT operations.

The diagram below shows the high-level architecture of integrating ThousandEyes with Event-Driven Ansible in your environment.

The key component of EDA is the rulebook. The rulebook is where event sources, rules, and actions are configured. To integrate ThousandEyes alert notifications with an Event-Driven Ansible rulebook, you can use the webhook event source plugin from the ansible.eda collection in your Ansible rulebook.

Basic Ansible Rulebook for ThousandEyes Webhooks

Let’s begin with the simplest Ansible rulebook to receive a webhook from the ThousandEyes platform. The code block below shows an Ansible rulebook that defines a webhook event source that listens on port 8080. The rulebook contains one rule, and the rule’s condition always passes, so this rulebook will perform an action for every event received. In this case, the action is simply printing out the event details on the console.

---
- name: Receive and print ThousandEyes webhook events
  hosts: all
  sources:
    - ansible.eda.webhook:
        host: 0.0.0.0
        port: 8080
  rules:
    - name: Print event
      condition: 1 == 1
      action:
        debug:
          msg:
            - "{{ event }}"

Save this to a file named simple-rulebook.yaml and run the rulebook with the following command:

ansible-rulebook -r simple-rulebook.yaml

Now that the rulebook is running on your EDA host and it’s ready to receive webhook events on port 8080, let’s create a ThousandEyes custom webhook to send alert notifications from the ThousandEyes platform to your EDA instance.

Creating a Custom Webhook

  1. In the ThousandEyes platform, navigate to Integrations, click + New Integration, and click Custom Webhook.

  2. In the Add Custom Webhook Integration panel, configure the fields for your new custom webhook.

    Name: EDA Webhook, or whatever you choose

    URL: http://<your-eda-host>:8080

    Preset Configurations: Generic

  3. Click Test at the bottom of the Add Custom Webhook Integration pane to send a test webhook from the ThousandEyes platform to your EDA server.

    You should see a success message in ThousandEyes indicating the webhook was successfully sent to EDA. On your EDA host, you should see the webhook body printed on the console.

  4. When your test has completed successfully, click Save to save the custom webhook integration for use with alert rules later in this guide.

In the minimal example above, we showed how to send ThousandEyes alert notifications to EDA using a custom webhook. Next, let’s take a look at how we can use that webhook payload in Ansible rule conditions so that EDA can automatically perform actions in response to the event.

The example rulebook and custom webhook shown above are for demonstration purposes only and do not include any authentication. In production or other sensitive environments, you should follow information security best practices, such as placing the EDA server behind a web application firewall, network firewall, and/or secure reverse proxy with TLS and authentication.

The custom webhook configuration shown above uses the built-in Generic preset body. The rest of this article is based on that body format. However, custom webhooks are built to be flexible; they allow you to customize the body of the webhook by adding, removing, or re-organizing fields within the body. To learn more about custom webhooks more generally, see Custom Webhooks.

Using ThousandEyes Alert Notifications in Event-Driven Ansible

When an event is received, EDA uses rules to determine if one or more actions should be executed. Each rule must contain a condition that is evaluated when an event is received. If the condition is met, the actions are executed.

Events from the webhook event source plugin include a payload field containing the body of the webhook. This allows you to use the details of the ThousandEyes alert notification in your EDA rule conditions. This section describes how to reference the webhook event’s payload to implement an “if-this-then-that” logic in Ansible rulebooks.

Active Alert Notifications vs. Cleared Alert Notifications

One of the most important event properties when writing Ansible rulebook conditions for ThousandEyes alert notifications is the notification’s type: whether this notification indicates an alert is “active” versus being “cleared”. An active alert notification is sent when the alert rule conditions are first seen, i.e, when the alert begins. A cleared alert notification is sent when the alert rule conditions are no longer met, i.e, when the alert is resolved.

For example, if we had a rulebook that performs a remediation action in response to an alert, we would only want to perform the remediation action when the alert is active, not when it’s cleared. However, we might still want downstream notifications in both cases: e.g., posting in a Webex or Slack channel as part of a ChatOps model.

For more information on alerts and how they change from active to cleared, see the Clearing Alerts section of the Alerts article.

The example rulebook below shows two rules. The first rule matches alert notifications that are active, and executes two actions: a remediation playbook, and a notification playbook. The second rule handles alert notifications that are cleared, and executes one action: the notification playbook.

---
- name: Handle ThousandEyes webhook events
  hosts: all
  sources:
    - ansible.eda.webhook:
        host: 0.0.0.0
        port: 8080

  rules:
    - name: Active Alert 
      condition: event.payload.type == "2"   # Active alert type
      actions:
        - run_playbook:
            name: ./remediation_playbook.yml
        - run_playbook:
            name: ./notification_playbook.yml
            
    - name: Cleared Alert
      condition: event.payload.type == "1"   # Cleared alert type
      actions:
        - run_playbook:
            name: ./notification_playbook.yml

This rulebook contains two rules, one named “Active Alert” and the other named “Cleared Alert”. The condition for both rules checks against the event.payload.type. Type “2” indicates the alert is active, and type “1” indicates the alert is cleared.

Alert Rule Expressions

Another important event property is the alert rule expression. Each ThousandEyes alert is created from an alert rule. Alert rules are configured with conditions that determine when an alert is triggered. The alert rule expression is a machine-readable representation of those alert rule conditions.

Alert rule expressions let you determine what kind of alert you are receiving in EDA. For example, an alert rule for HTTP server tests may have a condition that the HTTP response time is > 500ms. Such an alert rule would have an expression of (responseTime > 500 ms). As another example, an alert rule for an agent-to-server test may have multiple conditions, triggering when packet loss is at least 10% and network latency is at least 50 ms. This rule would have an expression of ((loss >= 10%) && (latency >= 50 ms)).

The alert rule expression can be used in the Ansible rule condition by referencing event.payload.alert.rule.expression. The codeblock below shows two EDA rules that match the example ThousandEyes alert rules described above.

  rules:
    - name: High HTTP Response Time
      condition: event.payload.alert.rule.expression is match("(responseTime > 500 ms)", ignorecase=true) and event.payload.type == 2
      action:
        # ...
    - name: Network Degraded
      condition: event.payload.alert.rule.expression is match("(loss >= 10%) && (latency >= 50 ms)", ignorecase=true) and event.payload.type == 2
      action:
        # ...

By using the alert rule expressions in your Ansible rulebook conditions, you can create automations for general use cases based on the type of alert received, without coupling your rulebook to specific test IDs or alert rule IDs. In the example shown above, different automated actions could be executed depending on the details of the event that was received.

Available Alert Rule Expressions

For a full list of alert rule expressions, see the alert rule metadata documentation in the ThousandEyes API developer reference. You can also use the Alert Rules API to query the details, including the expression, for a given alert rule.

Test Metadata

In some cases, you may want to create automations that are tightly coupled to one or more specific tests. For example, to specify that the automation action (such as an Ansible playbook) is specific to one host (the test target) and not applicable to your entire fleet of hosts and their respective ThousandEyes tests. In these cases, you can include the ThousandEyes test metadata as part of the EDA rule condition. Available properties include:

  • event.payload.alert.test.id

  • event.payload.alert.test.name

  • event.payload.alert.test.description

  • event.payload.alert.test.testType

Other Webhook Variables

There are many other alert details that can be used in your Ansible rulebooks. Additional fields include timestamps, severity, and distinct measurements, if applicable. For a complete list of webhook variables, see Webhook Variables.

When you use these variables within an Ansible rulebook, be sure to prefix the names with event.payload. For example, to use the alert.targets.size webhook variable, reference event.payload.alert.targets.size in your EDA rule condition.

Example Use Cases and Rulebooks

The following sections demonstrate example use cases of integrating ThousandEyes monitoring with Event-Driven Ansible automation.

Renew TLS Certificate

Failure to renew a certificate on time can result in website downtime and other service disruptions. Certificate renewal can be time-consuming and resource-intensive, particularly if it's done manually, and is prone to human error, which could lead to a certificate not being renewed correctly or on time.

In addition to capturing performance metrics like response time and the application response code, the ThousandEyes HTTP server test inspects TLS certificates when monitoring HTTPS targets and checks for validity, including expiration. An alert rule can be configured to trigger when TLS certificates have expired or when they will expire within some number of days. By combining ThousandEyes TLS monitoring with Event-Driven Ansible, you can automate certificate renewal and prevent service disruptions.

This example is based on a ThousandEyes Enterprise Agent monitoring an internal web application with a TLS certificate from an internal certificate authority, but is also applicable to Cloud Agents, public web applications, and public certificate authorities.

ThousandEyes Configuration

  1. Navigate to Alerts > Alert Rules and click Add New Alert Rule.

  2. In the Add New Alert Rule dialog, choose the alert type of Web > HTTP Server.

  3. In the Alert Conditions section, select Certificate / expires within / 14 days, as shown in the screenshot below.

  4. Use the Tests selector to assign the rule to one or more HTTP server tests.

  5. Click the Notifications tab, and in the Integrations section, select the custom webhook integration you had previously created.

  6. Finally, click Create New Alert Rule to apply your changes.

Event-Driven Ansible Rulebook

Use the following rulebook to receive ThousandEyes alert notifications in EDA. This example rulebook uses the webhook event source plugin and listens on port 8080, but you can change this to whatever port fits your requirements. The rulebook contains three rules:

  • A rule when the ThousandEyes alert notification is triggered and received, which executes an Ansible playbook to renew the TLS certificate

  • A rule when that renewal playbook execution succeeds

  • A rule when that renewal playbook execution fails

---
- name: ThousandEyes Webhook
  hosts: all
  sources:
    - ansible.eda.webhook:
        host: 0.0.0.0
        port: 8080
  rules:
    - name: Renew TLS Certificate
      condition: event.payload.alert.rule.expression is match("Certificate expires within", ignorecase=true) and event.payload.type == "2"
      action:
        run_playbook:
          name: ./renew_cert_playbook.yml
          post_events: true

    - name: Renew playbook success
      condition: event.renew_playbook_result is defined and event.renew_playbook_result.rc == 0
      actions:
        - debug:
            msg: "TLS Cert renewal succeeded!"

    - name: Renew playbook failure
      condition: event.renew_playbook_result is defined and event.renew_playbook_result.rc != 0
      actions:
        - debug:
            msg: "TLS Cert renewal failed!"

The first rule, which should match when the ThousandEyes alert notification is triggered, is based on the alert rule’s expression and the alert notification’s type. Specifically, this rule condition matches when the alert notification is triggered, not cleared, and when the alert rule expression matches certificate expiration, as in the alert rule we created above. When this first rule is matched, it runs an action of executing a playbook. In this example, the playbook will renew the TLS certificate on the target host. Additionally, the results of the playbook execution will be fed back into the rulebook.

The second and third rules match based on the results of the playbook executed in the first rule. This allows EDA to handle both possible outcomes of the attempt to automatically renew the certificate. If the playbook succeeded, one followup action may be to post an informational message to a Slack or Webex channel. If the playbook failed, there may be additional playbooks to run to try to renew the certificate, or to create an incident in an ITSM to escalate the issue and fall back to a manual process.

Restart Web Application

Today, many web applications are deployed behind load balancers or reverse proxies to distribute network or application traffic. This setup is designed to improve availability and reliability by balancing the load between servers and, in turn, preventing any single server from becoming a bottleneck.

However, if a backend application goes down, while clients can still reach the frontend load balancer, the load balancer will not be able to reach the backend. This inability to communicate with the backend can result in the load balancer timing out or responding with a 5XX HTTP response code. The 5XX codes indicate a server error and that the server is aware it has encountered an issue but is unable to perform the request.

This example is based on ThousandEyes Cloud and Enterprise Agents monitoring a public web application behind a reverse proxy and automating remediation when that reverse proxy responds with 5XX errors indicating a loss of connection to its backend.

ThousandEyes Configuration

  1. Navigate to Alerts > Alert Rules and click Add New Alert Rule.

  2. In the Add New Alert Rule dialog, choose the alert type of Web > HTTP Server.

  3. In the Alert Conditions section, select Response Code / is / server error (5xx), as shown in the screenshot below.

  4. Use the Tests selector to assign the rule to one or more HTTP server tests. Then, click the Notifications tab, and in the Integrations section, select the custom webhook integration you had previously created.

  5. Finally, click Create New Alert Rule to apply your changes.

Event-Driven Ansible Rulebook

Use the following rulebook to receive ThousandEyes alert notifications by webhook to EDA. This example rulebook uses the webhook event source plugin and listens on port 8080, but you can change this to whatever port fits your requirements. The rulebook contains three rules:

  • A rule when the ThousandEyes alert notification is triggered and received, which executes an Ansible playbook to restart the web application

  • A rule when that restart app playbook execution succeeds

  • A rule when that restart app playbook execution fails

---
- name: ThousandEyes Webhook
  hosts: all
  sources:
    - ansible.eda.webhook:
        host: 0.0.0.0
        port: 8080

  rules:
    - name: Restart Web Service
      condition: event.payload.alert.rule.expression is match("Response Code is server error (5xx)", ignorecase=true) and event.payload.type == 2
      action:
        run_playbook:
          name: ./restart_web_service.yml
          post_events: true

    - name: Restart service playbook success
      condition: event.restart_web_service is defined and event.restart_web_service.rc == 0
      actions:
        - debug:
            msg: "Restarting web service succeeded!"

    - name: Restart service playbook failure
      condition: event.restart_web_service is defined and event.restart_web_service.rc != 0
      actions:
        - debug:
            msg: "Restarting web service failed!"

The first rule, which should match when the ThousandEyes alert notification is triggered, is based on the alert rule’s expression and the alert notification’s type. Specifically, this rule condition matches when the alert notification is triggered, not cleared, and when the alert rule expression matches the one we just configured above, i.e., Response Code is server error (5xx). When this first rule is matched, it runs an action of executing a playbook. In this example, the playbook will restart the web application on the target host. Additionally, the results of the playbook execution will be fed back into the rulebook.

The second and third rules match based on the results of the playbook executed in the first rule. This allows EDA to handle both possible outcomes of attempting to restart the web application. If the playbook succeeded, one followup action may be to post an informational message to a Slack or Webex channel. If the playbook failed, there may be additional playbooks to run to try to remediate the 5xx server error, or to create an incident in an ITSM to escalate the issue and fall back to a manual process.

Last updated