Getting Started with Alerts

Alerting is a critical component of the ThousandEyes platform to inform operations teams of performance deviations or problems. From DNS availability to BGP reachability to layer-3 network metrics, ThousandEyes has a wide array of alert triggers. Learn how to use the alerting framework to your advantage by selecting the best alert rule, customizing rule conditions and receiving notifications.

This getting-started guide will cover

  • Creating alerts for your most important monitoring use cases.

  • Baselining and customizing modular alert rules.

  • Configuring notifications and alert integrations.

Prerequisites

You should already have at least one periodically running test set up, testing against a specific endpoint or service. If you don’t have this set up, see Getting Started with Tests. Additionally, to get the most out of this guide, we recommend you read the following before you continue:

Key Topics

Alerts Overview

Alerts' Relationship with Tests

Tests and alert rules have a many-to-many relationship. You can assign one or more alert rules to a single test; and you can also create a common set of alert rules and then assign them to a single test or to groups of tests.

Alert rules can be managed through the test settings interface or the alert rules interface. To view existing alert rules, go to Alerts > Alert Rules.

Parts of an Alert Rule

Each alert rule has two components: the conditions that must be met to trigger an alert; and a notification policy that specifies how you want to be notified about alerting events.

Trigger Conditions

There are a number of ways to manage the conditions under which an alert rule will trigger. All alert rule types include these basic conditions:

  • Threshold/s

  • Number of agents required to meet the threshold

  • Number of rounds or intervals across which those agents have to meet that threshold

Notification Policy

Alert notifications can be sent from the ThousandEyes platform via:

  • Email

  • Webhooks: An API-based integration with a platform of your choice. Preset configurations are available for tools such as Webex and Microsoft Teams.

  • Built-in alert integrations with third-party platforms like PagerDuty and ServiceNow

Types of Alert Rules

Alert rules are grouped into categories based on the application or network layer and source of the test. The categories of alert rules are:

  • Cloud and Enterprise Agents

  • Endpoint Agents

  • BGP Routing

  • Device

  • Internet Insights

This guide focuses on alerting on Cloud and Enterprise Agent-based tests.

Default Alert Rules

Go to Alerts > Alert Rules to see your current list of rules. In the list of alert rules, the ones starting with "Default" are default rules that ThousandEyes automatically adds to tests. The Apply To column indicates how many tests are applied to the rule.

Because our example HTTP server test runs on Cloud Agents, for this getting-started guide you'll configure alert rules for this test using the Cloud and Enterprise Agents tab.

Check out the default alert rules associated with this test type. For HTTP server tests, ThousandEyes automatically enables a "Default HTTP Server 2.0" alert rule and a "Default Network 2.0" alert rule. The blue checkbox in the righthand column indicates that this alert rule is being assigned automatically to HTTP server tests.

For a full list of default rules, see Default Alert Rules.

Start with a Test

Tailoring your alert rule to hit exactly the right conditions ensures you have the right amount of alerts coming in without getting too many false positives. To get started, use a test to establish a baseline metric.

This guide uses an HTTP server test to monitor total round-trip times for a customer-facing website. You can use the test you created in the Getting Started with Cloud and Enterprise Agent Tests or use an existing one that you have configured in your environment. The best practice is to set up tests using Cloud and Enterprise Agent locations based on where customers or users access the site.

For more information about HTTP server tests, see Internet and WAN Monitoring Tests.

To learn more about the HTTP Total Time metric as well as other available metrics, see ThousandEyes Metrics: What Do Your Results Mean?

Create a Baseline Dashboard

False alerts can result in alert fatigue or people ignoring events. Setting alert rule thresholds limits the conditions that can trigger an alert notification. To determine appropriate thresholds, establish a baseline for important metrics. Evaluating test results is the most accurate way to establish a metrics baseline.

Creating a dashboard is a quick and easy way to review test metrics in order to establish a baseline. Using the HTTP server test from the previous section with a dashboard can help to establish the maximum acceptable round-trip time for website availability. Setting this value too low could produce too many alerts, while setting it too high could result in missed detection of service degradation.

The example dashboard pictured below uses an HTTP server test to determine the total round-trip time from a set of cloud agents to google.com. The following dashboard views use the HTTP Total Time metric:

  • A timeline view showing the mean of the metric grouped by test

  • A timeline view grouped by agents measured by 98th percentile

  • A box and whiskers widget grouped by agent

All three views report HTTP total time from all agents over a 12-hour span. The top view, "HTTP Server Total Time (by Test)", reports that the average round-trip time from all agents is 120 ms.

The middle view, "HTTP Server Total Time (by Agent)", breaks the test data down by agent, providing more granularity. This view uses the 98th percentile of Total Time, rather than the mean, to show longer round-trip times that still fall within an acceptable level of service availability.

The bottom view, "Box and Whiskers (by Agent)", offers even greater precision. Also grouped by agent, this view shows the maximum, third quartile, median, first quartile, and minimum metric values. The Minneapolis threshold is reporting consistently below 200 ms, while The Dalles and Virginia locations are reporting below 100 ms.

Based on this analysis, a reasonable baseline for maximum acceptable HTTP total time is 200ms. Tests reporting a total round-trip time longer than 200ms, therefore, should trigger an alert. The next section covers configuring an alert rule using this threshold.

Use the drill-down selection to include multiple tests in the widget. Show individual lines in the timeline view by using the One Chart per Line option for each test, or agent-based on the "group by" selection.

For information on creating dashboards, see Getting Started with Dashboards.

Create a Useful Alert Rule

Custom alert rules allow for more accurate alerting that is more likely to reflect real impacts to service quality. Create a new custom alert rule for your example HTTP server test, using the baseline metric analysis you performed earlier.

To configure a new alert rule, navigate to Alerts > Alert Rules and click Add New Alert Rule.

For this example, configure the alert using the following values. Each field is explained in more detail below.

Alert Type

Choose the kind of test this alert applies to: (Test Layer, Test Type): Web, HTTP Server

Rule Name

Name the rule. For example, "Google HTTP Server Total Time"

Tests

Select the test to apply this rule to. This example uses "Google HTTP Server Test".

Severity

This field indicates how critical it is to respond to the alert. The default for this field is "Info". For this example, set to "Minor". For more information on severity, see Alert Severity.

Alert Conditions

The number of agents that must meet the alert rule's conditions in order to trigger an alert. For this example, set this field to: All conditions are met by "any of" "1" "agent" "3" of "3" times in a row. For more details on defining agent thresholds for alert conditions, see Global and Location Alert Conditions.

Note: use "+" to create additional conditions.

Create a condition using the test metric and the baseline value determined from the analysis in the previous section. Set to: "Total Time" ">=" "Static" "200" ms. For more information about alert rule operators and metrics, see: Available Metrics Operators and Units.

Click Create New Alert Rule to save the alert rule.

The alert rule created in this example has a Minor severity and will be triggered when the HTTP Total Time exceeds or equals 200 ms by any agent 3 out of 3 times in a row. Because this test has been configured to run every 2 minutes, after 6 minutes the alert will be triggered if any single agent's HTTP Total Time reports over 200 ms for 3 consecutive tests.

For more information on creating alert rules, see Creating and Editing Alert Rules.

Configure Alert Notifications

Alert notifications can be sent using email as well as to third-party solutions such as AppDynamics, PagerDuty, Splunk, Slack, and ServiceNow. For this guide we'll create an email notification. For information on third-party and custom webhook integrations, see Next Steps.

Send Alerts via Email

  1. Expand the Google HTTP Total Time alert rule you previously created, so that you can edit it.

  2. On the Notifications tab, click the drop-down arrow for the Send emails to field.

  3. Type or select the email addresses of users who should receive ThousandEyes alert notifications.

    For example, a NOC distribution list or an SRE team.

  4. If the email addresses you want to use aren't present in the drop-down list, click Edit external emails to add them.

  5. [Optional] Click Add message to customize the body of the email message users will receive

    For example, the email body might offer contact information or instructions on how to resolve the issue that activated the alert.

By default, email notifications are sent only when an alert is first triggered. To receive an email when the alert clears, check the checkbox below the Send emails to field. Note that alerts in the dashboard remain active as long as the triggering rule criteria are met.

For more information on alert notifications, see Alert Notifications.

Viewing Active Alerts

Use the active alert list to evaluate your alert thresholds, verify alert severity, and ensure that the proper agents are associated with an alert rule.

You can see a list of alerts in various ways in the ThousandEyes platform:

  • In a dashboard: The Alert List dashboard widget shows all active alerts.

  • On the Alerts > Alert List screen: Use this screen to see detailed reports on specific alerts.

In order to intentionally trigger an alert using the Google HTTP Test example from previous sections, modify the alert condition for total time by setting it to a lower value than what is typically reported in the tests. If the current threshold for total time is 200 ms, try setting it to 10 ms to trigger an alert. You can follow the same steps from Create a Useful Alert Rule to update the total time alert condition.

  1. View active alerts by clicking Alerts > Alert List.

  2. Use the search box to find specific alerts using the Alert ID or Name of the alert. For this example, type the word “google” into the search box.

  3. The listed alert reports the alert rule that was triggered, start time, scope, test name, and severity. Click on the small triangle to the left of the alert rule to expand an active alert. This expanded view shows the agents and related metrics to indicate why the alert was triggered.

  4. Click the stack icon to the left of an agent's name to drill into individual test results for a specific agent. The test results offer a timeline view of the alert activity.

For more information, see Viewing Alerts.

Alert Clearing

An alert is considered cleared when the conditions triggering it are no longer in effect. For alerts assigned to multiple locations, an alert will be cleared only when all locations no longer meet the alert conditions. This means that even if the alert conditions specify a minimum number of locations for the alert to be triggered, the alert will not clear until all locations no longer meet the alert conditions.

If an agent is unable to send data to the platform after becoming associated with an alert, the platform is able to auto-clear the alert after 12 hours of recieving no data from the active agent. To clear an alert manually, un-assign the alert rule from the test. Do this as you continue to refine alert rules to match your specific criteria.

For more information on alert clearing, see Alert Clearing.

Analyzing Alert History

By default, the ThousandEyes platform shows the last 90 days of alerts. To see your organization's alert history, go to Alert > Alert List > Alerts History. Use the search bar to quickly find alerts. Use the time filter to isolate alerts based on a particular time period. Filtering alerts this way can help you determine whether your alerts are configured properly to align with your monitoring goals, and can help you report on alerts associated with an outage or service interruption event.

For more information, see Alerts History.

Tune Your Alert Rule

Alert rule conditions are a powerful way to minimize extra alert noise by ensuring that alerts represent real service impacts. This section demonstrates an alternative to hard-coding a static number of agents, and shows how to configure multiple alert conditions.

Specifying a static number of agents may not be ideal for many real-world situations. Using a percentage is more flexible, especially if you have a large number of agents or you are frequently adding and removing agents.

  1. Go to Alerts > Alert Rules and select the Google HTTP Total Time example you previously created.

  2. Under Alert Conditions, change Any of 1 agent to 5 % of agents.

  3. Click Save Changes to update the alert rule.

An alert rule also supports multiple conditions. Using additional conditions can help you minimize extra alert noise by ensuring that alerts represent real service impacts.

In the Google HTTP Total Time example, you can specify HTTP Response Time as a condition in addition to HTTP Total Time:

  1. Click the + button to the right of the Total Time alert condition.

  2. Set to Response Time Static >= 100 ms.

  3. Click Save Changes to update the alert rule.

Configuring good tests is also an important step in effective alerting. For more information on configuring tests, see Getting Started with Tests

Next Steps

Monitoring alerts and managing alert thresholds is an ongoing process. Review alerts and alert rules regularly to make sure that the right alerts are triggered during service interruptions and the right stakeholders are consistently notified. Use the timeline view to make sure baselines continue to be accurate on a regular basis. Being proactive with alerts reduces false positives and increases trust in the alerts from ThousandEyes, encouraging quick and effective service restoration.

For more information on setting up alerts, see the following:

For more information on third-party integrations and custom webhooks:

Continue your getting-started journey:

Last updated