Comment on page
Getting Started with Alerts
Alerting is a critical component of the ThousandEyes platform to inform operations teams of performance deviations or problems. From DNS availability to BGP reachability to layer-3 network metrics, ThousandEyes has a wide array of alert triggers. Learn how to use the alerting framework to your advantage by selecting the best alert rule, customizing rule conditions and receiving notifications.
This getting-started guide will cover
- Creating alerts for your most important monitoring use cases.
- Baselining and customizing modular alert rules.
- Configuring notifications and alert integrations.
You should already have at least one periodically running test set up, testing against a specific endpoint or service. If you don’t have this set up, see Getting Started with Tests. Additionally, to get the most out of this guide, we recommend you read the following before you continue:
Tests and alert rules have a many-to-many relationship. You can assign one or more alert rules to a single test; and you can also create a common set of alert rules and then assign them to a single test or to groups of tests.
Alert rules can be managed through the test settings interface or the alert rules interface. To view existing alert rules, go to Alerts > Alert Rules.
There are a number of ways to manage the conditions under which an alert rule will trigger. All alert rule types include these basic conditions:
- Number of agents required to meet the threshold
- Number of rounds or intervals across which those agents have to meet that threshold
Alert notifications can be sent from the ThousandEyes platform via:
- Webhooks: An API-based integration with a platform of your choice. Preset configurations are available for tools such as Webex and Microsoft Teams.
- Built-in alert integrations with third-party platforms like PagerDuty and ServiceNow
Alert rules are grouped into categories based on the application or network layer and source of the test. The categories of alert rules are:
- Cloud and Enterprise Agents
- Endpoint Agents
- BGP Routing
- Internet Insights
This guide focuses on alerting on Cloud and Enterprise Agent-based tests.
Go to Alerts > Alert Rules to see your current list of rules. In the list of alert rules, the ones starting with "Default" are default rules that ThousandEyes automatically adds to tests. The Apply To column indicates how many tests are applied to the rule.
Because our example HTTP server test runs on Cloud Agents, for this getting-started guide you'll configure alert rules for this test using the Cloud and Enterprise Agents tab.
Check out the default alert rules associated with this test type. For HTTP server tests, ThousandEyes automatically enables a "Default HTTP Server 2.0" alert rule and a "Default Network 2.0" alert rule. The blue checkbox in the righthand column indicates that this alert rule is being assigned automatically to HTTP server tests.
Alert Default HTTP Server 2.0 Example
Tailoring your alert rule to hit exactly the right conditions ensures you have the right amount of alerts coming in without getting too many false positives. To get started, use a test to establish a baseline metric.
This guide uses an HTTP server test to monitor total round-trip times for a customer-facing website. You can use the test you created in the Getting Started with Cloud and Enterprise Agent Tests or use an existing one that you have configured in your environment. The best practice is to set up tests using Cloud and Enterprise Agent locations based on where customers or users access the site.
False alerts can result in alert fatigue or people ignoring events. Setting alert rule thresholds limits the conditions that can trigger an alert notification. To determine appropriate thresholds, establish a baseline for important metrics. Evaluating test results is the most accurate way to establish a metrics baseline.
Creating a dashboard is a quick and easy way to review test metrics in order to establish a baseline. Using the HTTP server test from the previous section with a dashboard can help to establish the maximum acceptable round-trip time for website availability. Setting this value too low could produce too many alerts, while setting it too high could result in missed detection of service degradation.
The example dashboard pictured below uses an HTTP server test to determine the total round-trip time from a set of cloud agents to google.com. The following dashboard views use the HTTP Total Time metric:
- A timeline view showing the mean of the metric grouped by test
- A timeline view grouped by agents measured by 98th percentile
- A box and whiskers widget grouped by agent
All three views report HTTP total time from all agents over a 12-hour span. The top view, "HTTP Server Total Time (by Test)", reports that the average round-trip time from all agents is 120 ms.
The middle view, "HTTP Server Total Time (by Agent)", breaks the test data down by agent, providing more granularity. This view uses the 98th percentile of Total Time, rather than the mean, to show longer round-trip times that still fall within an acceptable level of service availability.
The bottom view, "Box and Whiskers (by Agent)", offers even greater precision. Also grouped by agent, this view shows the maximum, third quartile, median, first quartile, and minimum metric values. The Minneapolis threshold is reporting consistently below 200 ms, while The Dalles and Virginia locations are reporting below 100 ms.
Based on this analysis, a reasonable baseline for maximum acceptable HTTP total time is 200ms. Tests reporting a total round-trip time longer than 200ms, therefore, should trigger an alert. The next section covers configuring an alert rule using this threshold.
Use the drill-down selection to include multiple tests in the widget. Show individual lines in the timeline view by using the One Chart per Line option for each test, or agent-based on the "group by" selection.
Custom alert rules allow for more accurate alerting that is more likely to reflect real impacts to service quality. Create a new custom alert rule for your example HTTP server test, using the baseline metric analysis you performed earlier.
To configure a new alert rule, navigate to Alerts > Alert Rules and click Add New Alert Rule.
Alert Rules List
For this example, configure the alert using the following values. Each field is explained in more detail below.
Click Create New Alert Rule to save the alert rule.
The alert rule created in this example has a
Minorseverity and will be triggered when the
HTTP Total Timeexceeds or equals 200 ms by any agent 3 out of 3 times in a row. Because this test has been configured to run every 2 minutes, after 6 minutes the alert will be triggered if any single agent's
HTTP Total Timereports over 200 ms for 3 consecutive tests.
Alert notifications can be sent using email as well as to third-party solutions such as AppDynamics, PagerDuty, Splunk, Slack, and ServiceNow. For this guide we'll create an email notification. For information on third-party and custom webhook integrations, see Next Steps.
- 1.Expand the Google HTTP Total Time alert rule you previously created, so that you can edit it.
- 2.On the Notifications tab, click the drop-down arrow for the Send emails to field.
- 3.Type or select the email addresses of users who should receive ThousandEyes alert notifications.For example, a NOC distribution list or an SRE team.
- 4.If the email addresses you want to use aren't present in the drop-down list, click Edit external emails to add them.
- 5.[Optional] Click Add message to customize the body of the email message users will receiveFor example, the email body might offer contact information or instructions on how to resolve the issue that activated the alert.
Setting alert notifications to email
By default, email notifications are sent only when an alert is first triggered. To receive an email when the alert clears, check the checkbox below the Send emails to field. Note that alerts in the dashboard remain active as long as the triggering rule criteria are met.
Alert Clear Email
Use the active alert list to evaluate your alert thresholds, verify alert severity, and ensure that the proper agents are associated with an alert rule.
You can see a list of alerts in various ways in the ThousandEyes platform:
- In a dashboard: The Alert List dashboard widget shows all active alerts.
- On the Alerts > Alert List screen: Use this screen to see detailed reports on specific alerts.
In order to intentionally trigger an alert using the Google HTTP Test example from previous sections, modify the alert condition for total time by setting it to a lower value than what is typically reported in the tests. If the current threshold for total time is 200 ms, try setting it to 10 ms to trigger an alert. You can follow the same steps from Create a Useful Alert Rule to update the total time alert condition.
Active Alert List
- 1.View active alerts by clicking Alerts > Alert List.
- 2.Use the search box to find specific alerts using the
Nameof the alert. For this example, type the word “google” into the search box.
- 3.The listed alert reports the alert rule that was triggered, start time, scope, test name, and severity. Click on the small triangle to the left of the alert rule to expand an active alert. This expanded view shows the agents and related metrics to indicate why the alert was triggered.
- 4.Click the stack icon to the left of an agent's name to drill into individual test results for a specific agent. The test results offer a timeline view of the alert activity.
An alert is considered cleared when the conditions triggering it are no longer in effect. For alerts assigned to multiple locations, an alert will be cleared only when all locations no longer meet the alert conditions. This means that even if the alert conditions specify a minimum number of locations for the alert to be triggered, the alert will not clear until all locations no longer meet the alert conditions.
If an agent is unable to send data to the platform after becoming associated with an alert, the platform is able to auto-clear the alert after 12 hours of recieving no data from the active agent. To clear an alert manually, un-assign the alert rule from the test. Do this as you continue to refine alert rules to match your specific criteria.
By default, the ThousandEyes platform shows the last 90 days of alerts. To see your organization's alert history, go to Alert > Alert List > Alert History. Use the search bar to quickly find alerts. Use the time filter to isolate alerts based on a particular time period. Filtering alerts this way can help you determine whether your alerts are configured properly to align with your monitoring goals, and can help you report on alerts associated with an outage or service interruption event.
Alert rule conditions are a powerful way to minimize extra alert noise by ensuring that alerts represent real service impacts. This section demonstrates an alternative to hard-coding a static number of agents, and shows how to configure multiple alert conditions.
Specifying a static number of agents may not be ideal for many real-world situations. Using a percentage is more flexible, especially if you have a large number of agents or you are frequently adding and removing agents.
- 1.Go to Alerts > Alert Rules and select the Google HTTP Total Time example you previously created.
- 2.Under Alert Conditions, change
Any of 1 agentto
5 % of agents.
- 3.Click Save Changes to update the alert rule.
An alert rule also supports multiple conditions. Using additional conditions can help you minimize extra alert noise by ensuring that alerts represent real service impacts.
In the Google HTTP Total Time example, you can specify
HTTP Response Timeas a condition in addition to
HTTP Total Time:
- 1.Click the + button to the right of the
Total Timealert condition.
- 2.Set to
- 3.Click Save Changes to update the alert rule.
Monitoring alerts and managing alert thresholds is an ongoing process. Review alerts and alert rules regularly to make sure that the right alerts are triggered during service interruptions and the right stakeholders are consistently notified. Use the timeline view to make sure baselines continue to be accurate on a regular basis. Being proactive with alerts reduces false positives and increases trust in the alerts from ThousandEyes, encouraging quick and effective service restoration.
For more information on setting up alerts, see the following:
For more information on third-party integrations and custom webhooks:
Continue your getting-started journey: