Adaptive Alert Detection

Adaptive alert detection is a combination of probability theory, historic agent behavior, and your sensitivity setting applied to your alert rule’s global alert setting. Adaptive alerting is designed to deduce when a real alertable issue occurs based on your test's anomaly records, reducing the number of false positives you receive and increasing your confidence that the alerts you do receive are genuine issues.

It's important to note the difference between an anomaly and an issue. An anomaly is a test result that breaches a certain threshold which you consider "normal". But network behavior means that your "normal" might fluctuate depending on many factors, so the breach may not in fact constitute an "issue" which is a genuine problem that needs attention to get the test back to normal limits.

Adaptive alerting takes much of the tedium and uncertainty out of setting alert conditions manually, which can be a bit of a lengthy guessing game. Set the conditions too broadly and you could be inundated with noisy alerts that don’t signify true issues. Set them too narrowly, and you might miss being alerted about a bona fide problem. In either case, you could observe your test results over time and fine-tune the alert condition to better approximate the set of conditions that would result in a useful alert, but this heuristic tinkering often takes time, doesn't adjust for network conditions, and can't account for historic patterns.

By contrast, adaptive alerting instantly assesses the presence of an issue by studying anomaly history and recent network behavior, while reducing noise by recognizing when an issue is just one issue, and not several. Moreover, if network characteristics slowly change over time, there's no need to reconfigure the alert rule - it adjusts automatically.

Significantly, adaptive alerting is personalized to each test you run. Your organization’s use of the network is unique, and the tests you run are configured to optimize your organization’s performance. As the name suggests, this method of issue identification adapts to your needs and to the state of the network, sending you only the alerts that matter to you.

Adaptive alert detection is available for Cloud and Enterprise Agent tests only.

How Adaptive Alerting Works

Adaptive alerting differentiates between normal network variations and actual issues. It does this by using a patent-pending statistical model that does the laborious test result analysis over time for you, saving you time and improving your results.

What the Model Considers

This method of alerting breaks down into three distinct steps. These steps happen round after round on each combination of test and alert rule applied to it.

  1. Agents identify anomalies.

  2. The model algorithmically calculates the probability of an issue as a result of the anomaly detection.

  3. If the probability of an issue being present passes a threshold which depends on your sensitivity setting, an alert is triggered.

The model takes into account three factors that help to ensure that it is identifying real issues that matter to you. These are:

  • Anomaly baseline: The anomaly rate observed over a period of days. This amounts to a built-in margin-of-error mechanism that accounts for the fact that detected anomalies do not always correspond to real issues.

    For example, if the average anomaly rate over the past days is 20% (i.e., any 2 out of 10 agents continually identify anomalies), this becomes the anomaly baseline for that test. The model uses this baseline to help it understand how the anomalies it’s seeing compare to a historical average for that test, and adjust its issue probability accordingly.

Importantly, the anomaly baseline calculation does not consider whether issues were confirmed in previous rounds as a result of the detection of a certain percentage of anomalies. This ensures the calculation of issue probability in the current round is free from any undue bias.

The model continually adjusts the baseline based on the historical data it receives. If, over time, the number of anomalies observed averages out to 30% instead of 20%, then this becomes the new baseline. Similarly, if network anomalies decrease, then the baseline adjusts downwards. In this way, your alert mechanism moves in step with the network.

  • Test frequency: Network issues are not typically momentary blips. They usually last for some minutes or hours. So, your test frequency has a large bearing on whether you (or, indeed the model) can confidently discern an issue based on the anomalies agents record over different time periods. The model takes test frequency into account and adapts its probability calculations accordingly.

    Thanks to many years of operational data, we know how long the average issue lasts, and how often one typically occurs. We leverage this knowledge to determine the probability of an issue remaining an issue over each test interval, from one minute up to an hour. This helps the model maintain a consistency of issue probability regardless of the test frequency.

  • Repeated agent anomalies: The model takes account of which agents are flagging anomalies. If, for example, one agent in ten regularly reports an anomaly, this might skew the baseline low, to the point where an anomaly reported by any one agent is not considered an issue. However, if the same agent consistently reports an anomaly, this could be considered an issue for that agent. Therefore, when the same agent reports an anomaly several times in a row, the model considers this as part of its determination of whether an issue is present.

How the Model Uses Probability

Adaptive alerting calculates the probability of an underlying (unobservable) issue based on the anomalies agents do observe. Because not all anomalies amount to an actual issue, the model has to take what it can observe – anomalies – and make a determination from that about what it can’t observe – genuine issues.

You can think of it in terms of expected and unexpected fire alarms in your office building. The alarm (the observable anomaly) may go off at the same time each week to test the system. You do not see a fire (the issue), but the alarm suggests there could be one. However, you have set a mental anomaly baseline as one fire alarm per week = normal; therefore you determine there's no issue and you do not evacuate. When a fire alarm goes off unexpectedly, you again may not see whether there's an actual fire, but you evacuate anyway because the baseline has been breached and this, you deduce, constitutes an actual issue.

The model calculates in real time the probability of having an issue in the current round given the state your test was in in previous rounds, while also accounting for test frequency, repeated agent anomalies, and the anomaly baseline. Once issue probability is calculated, all that is left is to determine the threshold beyond which to trigger an alert. You do this yourself by setting your sensitivity level.

When multiple metric conditions are added to an adaptive alert rule, you can choose whether the alert takes all conditions into account (the logical AND) or if only one is required to trigger an alert (the logical OR).

Adaptive Alert Sensitivity

While the model does a lot of the work for you, you still have control over how sensitive you want the model to be. Setting the sensitivity level to high means that you receive more alerts, whereas setting it to low means you receive fewer alerts.

Sensitivity works by setting an upper and lower probability threshold above and below which alerts are triggered and then cleared. For example, the medium sensitivity thresholds might be 80% to trigger an alert and 20% to clear it. That means that once the probability of there being an issue hits 80%, an alert is triggered. The same alert does not clear until the probability of there being an issue lowers to 20%. In this way, the probability can fluctuate between 80% and 20% without the alert “flapping” on and off, the way some alerts do with manually set alerts, but remains open until the issue probability has decreased to a level more realistically considered to be ‘resolved’. See Reducing Alert Noise for more information about flappy alerts, and the Adaptive Alert Example to see this working in practice.

High sensitivity probability thresholds are lower than medium, which in turn are lower than low sensitivity thresholds. If the high probability threshold is 60%, medium might be 80% and low might be 90%, for example, noting that there is an inverse relationship between the sensitivity level and its probability thresholds (i.e the highest sensitivity level = the lowest probability threshold). However, as noted above, there is a direct correlation between your sensitivity level and the amount of alerts you receive; high sensitivity = higher amount of alerts.

Adaptive alerting is available for all metrics including static and dynamic metrics. The sensitivity setting determines when an alert triggers as a consequence of the anomalies detected on agent data. The default sensitivity setting for adaptive alert rules is medium.

When Used with Quantile Dynamic Baselining

New alerts are configured by default to using both adaptive alerting and quantile dynamic baselining (for those metrics that support dynamic baselining), though you may have some older alerts that are still set to manual alerting or static baselining. You can configure new and old alerts to any combination of alerting method (adaptive/manual) and baselining method (dynamic/static) (see Implementing Adaptive Alerting for more information).

There are some things to note when you use adaptive alerting and quantile dynamic baselining together:

  1. You will not see an option to set a threshold beyond which an agent identifies an anomaly (e.g. jitter ≥ 15 ms) because quantile dynamic baselining automatically does that for you.

  2. You do not need to set the sensitivity level for quantile dynamic baselining when using adaptive alerting. This is because the sensitivity setting for adaptive alerting automatically adjusts the anomaly threshold as part of its algorithm.

  • If you are updating an alert from manual to adaptive alerting, any previous sensitivity setting you had for quantile dynamic baselining will remain in place should you choose to go back to the manual setting. However, changing the quantile sensitivity setting does not impact the results of adaptive alerting. For example, if you change your quantile dynamic baseline setting in manual mode from low to high, you will not see any greater frequency of alerts in the adaptive mode unless you change the sensitivity setting on adaptive alerts directly.

  1. Quantile dynamic baselining continues to look back over 24 hours to assess anomaly thresholds. Adaptive alerting looks back over several days to assess the presence of alertable issues.

Adaptive Alert Example

Manual Alert Method

Our manual alert rule configuration allows you to set an alert to trigger after a certain number of agents, over a certain number of test rounds, identifies your test results as exceeding a certain limit on a specific metric.

For example, you might set an alert to trigger when any three agents experience jitter greater than 15 ms at least four times in a row on a Network Agent-to-Server test, where the test frequency is set to every 15 minutes.

The above approach might work fine in ideal conditions, but networks rarely work within ideal conditions, nor do they conform to causing issues within rigidly defined parameters. For example, if 10 agents registered jitter at 100 ms, but only for three 15-minute rounds in a row, the above alert would not fire, even though there’s clearly an issue.

Adaptive Alert Method

With adaptive alerting, the global alert conditions dynamically react to current network conditions to assess whether your metric threshold has been breached by enough agents over enough rounds to constitute an issue.

Suppose we're testing jitter on a Network Agent-to-Server test, as above. Assuming we're using adaptive alerting alongside quantile dynamic baselining (see When Used with Quantile Dynamic Baselining), the threshold above which an anomaly is registered is dynamically determined (that is, it's not set to 15 ms as in the above example, but is adjusted based on test and network behavior). Let's also assume that, looking back over previous days, as the model does, very few anomalies were detected at all, making the anomaly baseline close to 0.

Now, if we look at the last 48 hours, the blue curve shows the number of agents detecting an anomaly during this time. The red curve shows the calculated probability of having an underlying issue, where 1 = 100%.

Notice that between the hours of 9 and 22, there is almost always one agent detecting an anomaly. Because the baseline had been close to 0 before, this new temporal concentration of anomalies is a small but consistent indication of an underlying issue. This is why the issue probability slowly rises to 0.2, or 20%.

Between the hours of 25 and 37, the issue probability increases to almost 100% thanks to a modest but consistent uptick in the contribution of anomalies.

A medium sensitivity setting with thresholds of 80% and 20% for triggering and clearing would open an alert here from hours 28 to 38. Note that even though the probability of there being an issue between these hours periodically dips below 80%, the alert does not actually clear until the probability reaches 20%, thus minimizing alert noise (see Reducing Alert Noise for more information about reducing alert noise).

By contrast, a manual alert rule set to “two anomalies twice in a row” would never have triggered an alert.

Reducing Alert Noise

With the manual method of alerting, network performance that hovers right around the threshold of an alert could trigger multiple alerts as the network oscillates back and forth across the alert threshold. For example, if an alert is set to trigger when jitter is greater than 15 ms, it will only clear once jitter reduces to 15 ms or less. If jitter oscillates between 14 ms and 16 ms, then each time it gets to 16 ms, an alert is triggered, and each time it's 15 ms or below, it's cleared. This just generates noise. In ThousandEyes parlance, these on-again, off-again alerts are called “flappy” alerts.

Adaptive alerting, by contrast, can do one of two things depending on the alert’s recent history. If the baseline (see What the Model Considers) is set high enough, it might determine that the oscillation is a small enough breach that no alert is sent at all. If the baseline is set low enough to trigger an initial alert once 15 ms is breached the first time, it is more likely to remain active while the network oscillates, and only clear when jitter drops below the level of oscillations, thus eliminating many unnecessary alerts for the same issue.

Implementing Adaptive Alerting

Adaptive alerting is the default setting for the global alert condition when you create a new alert rule. Here’s how to take advantage of it.

Creating a New Rule Using Adaptive Alerting

  1. Go to Alerts > Alert Rules.

  2. On the Cloud and Enterprise Agent tab, click Add New Alert Rule.

  • The rule panel opens to the Settings tab.

  • Within the Alert Conditions section, find the Alert Detection toggle.

  • The toggle defaults to the Adaptive setting.

  1. Set your Alert Sensitivity using the slider.

  • The default setting is Medium.

  1. Once you have configured the rest of your alert rule, click Create New Alert Rule.

Updating an Existing Rule to Adaptive Alerting

  1. Go to Alerts > Alert Rules.

  2. Find and click the alert rule you wish to update.

  • The rule panel opens to the Settings tab.

  • Within the Alert Conditions section, find the Alert Detection toggle.

  1. Change the Alert Detection setting to Adaptive.

  • For location alert settings that use static metrics, the metrics will remain the same.

  • For location alert settings that use dynamic metrics, notice that the sensitivity settings appear to have been removed. In fact, they have just been subsumed into the Adaptive Alert Sensitivity setting, so you only set the sensitivity level once, and it applies to both conditions.

  1. Click Save Changes.

Last updated