Dynamic Baselines

Dynamic baselines are calculation systems that identify deviations from normal values in a dataset, where the distribution of normal values is determined from a set of past values of the data (i.e., past behavior). Dynamic baselines allow you to create alert rules that more accurately reflect the natural fluctuation in your test data.

Types of Dynamic Baselines

There are two types of dynamic baselining you can choose from. The first is quantile dynamic baselining. The second comprises three different calculations that use more classic dynamic baselining methods; these methods are standard deviation, percentage change, and absolute values. Using any of the quantile or classic methods, you can configure alert rules that dynamically determine whether to fire an alert or not, based on the historical data of a specific test.

Dynamic baselines are currently only available for Cloud, Enterprise, and Endpoint Agent alerts.

24-Hour Window for Dynamic Baselines

When you configure an alert rule that uses a dynamic baseline in its alert conditions, the ThousandEyes platform looks back at a 24-hour window to determine the baseline of a specific metric.

Within a 24-hour window, the baseline for the metric is based on at least 24 data points (24 hours * 1-hour test intervals). This ensures that alerts are triggered by a valid deviation from a reliable baseline. For additional accuracy, the classic dynamic baselines are updated every five minutes, and quantile dynamic baselines every 15 minutes, to accommodate any new data from new test rounds.

Inspecting Alerts Triggered by a Dynamic Baseline

When an alert is triggered based on a dynamic-baseline alert rule, you can inspect the dynamic baseline used. In the ThousandEyes platform, go to Alerts > Alert List. On the Active Alerts or Alert History tab, hover over the tooltip to see condition information about the alert rule that generated this alert.

Metrics for Dynamic Baselines

Dynamic baselines are currently supported for the following alert types and metrics.

Note that all metrics default to the same values for all dynamic baseline types except absolute values. Quantile sensitivity (the variable associated with the quantile calculation) defaults to medium sensitivity, standard deviation defaults to 2 standard deviations above the baseline, and percent change defaults to 20% above the baseline; only absolute value defaults to different values depending on the metric.

Cloud and Enterprise Agents

Alert TypeMetric

DNS - DNS server

Resolution time

Network – Agent-to-agent

Jitter

Network - Agent-to-agent

Latency

Network – Agent-to-server

Jitter

Network - Agent-to-server

Latency

Network – Agent-to-server

Proxy jitter

Network – Agent-to-server

Proxy latency

Voice - RTP stream

Latency

Voice – RTP stream

Packet delay variation

Voice – SIP server

Connect time

Voice - SIP server

DNS time

Voice – SIP server

Invite time

Voice – SIP server

Options time

Voice – SIP server

Register time

Voice – SIP server

Response time

Voice - SIP server

Total time

Voice – SIP server

Wait time

Web – FTP server

Connect time

Web - FTP server

DNS time

Web – FTP server

FTP negotiation time

Web – FTP server

Response time

Web - FTP server

SSL negotiation time

Web - FTP server

Total time

Web – FTP server

Transfer time

Web – FTP server

Wait time

Web - HTTP server

Connect time

Web - HTTP server

DNS time

Web - HTTP server

Receive time

Web - HTTP server

Response time

Web - HTTP server

SSL negotiation time

Web - HTTP server

Total time

Web - HTTP server

Wait time

Web - Page load

DOM load time

Web - Page load

Page load time

Web - Page load

Response time

Web - Transaction

Transaction time

Endpoint Agents

Alert TypeMetric

Network - Endpoint end-to-end (server)

Jitter

Network - Endpoint end-to-end (server)

Latency

Web - Endpoint HTTP-server

Connect time

Web - Endpoint HTTP-server

DNS time

Web - Endpoint HTTP-server

Receive time

Web - Endpoint HTTP-server

Response time

Web - Endpoint HTTP-server

SSL negotiation time

Web - Endpoint HTTP-server

Throughput

Web - Endpoint HTTP-server

Total time

Web - Endpoint HTTP-server

Wait time

Quantile Dynamic Baselining

Quantile dynamic baselines work well when determining anomalies in abnormally distributed datasets, e.g., those that might have an imperfect shape when plotted on a graph; datasets that ThousandEyes customers often have. Quantile calculation takes a set of data points and calculates values such that a given proportion of data points sit below them. For example, when you split a set of data points into four equal parts, the first quartile has 25% of data points below it, the second quartile (the median) has 50% of data points below it, and so on. While the calculation depends on the ordering of the data points, it is agnostic to the values the quantiles represent. This is why they are great for abnormally shaped data curves, such as the positively skewed distribution curve our customers’ test data often exhibits.

Focusing analysis on the interquartile range (the middle 50% of data points) can help to eliminate major outliers from overinfluencing where the baseline rests, and reduces the number of false positive alerts.

ThousandEyes employs a version of this quantile method, where we define a lower quantile and an upper quantile to “fence in” the most behaviorally sound data (similar to the interquartile range), against which we apply a threshold factor (relating to your sensitivity level – high, medium, low) that further determines the threshold beyond which data points are considered anomalies. The threshold factor adjusts the “fence” wider or narrower to give you agency in determining your appetite for more or fewer alerts.

The formula that makes this possible is:

Alert threshold = (Qupper - Qlower) * k + Qlower + 1

Where Qupper is the higher quantile, Qlower is the lower quantile, and k is the threshold factor.

While the upper and lower quantiles and threshold factors are fixed, the data values they correspond to are constantly changing with the changing dataset, effectively adapting the alerting mechanism to the current shape and behavior of the data. For example, in a wide-ranging dataset, quantile analysis eliminates the higher and lower extremes to determine the baseline and threshold, resulting in alerts that fall outside an otherwise large range of values. In this scenario, small fluctuations outside the threshold get ignored in preference for larger deviations that distinguish true outliers. Meanwhile, in a dataset that is virtually flat, the upper and lower quantiles would essentially equal each other, so even a tiny variation would likely result in an alert, which may create unnecessary noise. To mitigate this, we introduce an additional 1 ms deviation threshold (the "+ 1" in the formula) to keep noise to a minimum.

We update the values the quantiles represent every 15 minutes, where the window over which they are computed is 24 hours.

Sensitivity Level

Sensitivity is defined as your need to see more or fewer alerts. A high sensitivity level corresponds to your need to see more alerts, whereas a low sensitivity level corresponds to your need to see fewer alerts.

A good way to think about this is in terms of business criticality. If you set an alert against a test that determines how well a very critical part of your business is performing, you may want to see any slight deviation from a normal range and decide for yourself whether the deviation is anything to worry about. You have a high sensitivity level. This would result in you receiving more alerts, though not all of them may be true anomalies.

If you set an alert against a test that monitors a part of your business that is perhaps less urgent in nature should something go wrong, you may wish to minimize noise and see fewer alerts here, but where the alerts are more likely to suggest a true anomaly. You have a low sensitivity level.

Your sensitivity level and the equation’s threshold factor have an inverse relationship. As the threshold factor (k) increases (i.e., threshold allows for only larger deviations), your sensitivity level decreases (i.e., larger deviations equal fewer alerts), and vice versa. See the quantile example scenario for how this works in practice. When you select the sensitivity setting on your alert rule, you are selecting your sensitivity level, not the threshold factor.

Quantile Example Scenario

To simplify an example, let’s set the upper quantile to the 90th percentile (90% of data points sit below it), or Q90, and its value to 100 ms. Set the lower quantile to the 10th percentile, or Q10, and its value to 10 ms. Set the threshold factor to 2. The formula would look like this:

Alert threshold = (Q90 – Q10) * 2 + Q10 + 1

Alert threshold = (100 ms – 10 ms) * 2 + 10 ms + 1 ms

Alert threshold = 191 ms

This means that an alert will fire for any value above 191 ms. Note in the image below that the threshold value is still within the range of values that have appeared over the last 24 hours, but eliminates from anomaly detection all but the most extreme end of the data points from triggering alerts. Increase the threshold factor (which in this case lowers your sensitivity level), and the threshold will move farther to the right, eliminating even more data points from anomaly detection and decreasing the number of alerts you receive. Lower the threshold factor (higher sensitivity level), and you receive more alerts.

Classic Dynamic Baselining

Classic dynamic baselines let you choose between three types of deviations from a mean: standard deviation, percentage (relative) change, and absolute value.

  • Standard deviation: This calculation first establishes the average deviation of all your data points from the mean; this is done by taking the square root of the variance (sum of differences between individual data points and the mean, squared to remove negative values). Once calculated, you can determine how many standard deviations from the mean you wish to set before triggering an alert (the default is set to two).

  • Percentage change: This calculation determines when to trigger an alert based on the relative change from the mean, in percentage terms, such as 20% (the default) above or below the mean.

  • Absolute value: This calculation determines when to trigger an alert based on a static (absolute) value above or below the mean, such as 50 ms (defaults differ depending on metric).

Classic Example Scenario

Imagine a scenario where an HTTP server test runs every ten minutes. Over the course of 24 hours, an agent in New York runs the test 144 times. The test gathers response times with an average of 500 ms.

Attached to this test is an alert rule that uses a dynamic baseline. Based on the results so far, whether an alert will fire or not for the next test round depends on whether it was configured using standard deviation, percentage change, or an absolute value:

  • Standard deviation - Say that the standard deviation for last 3 hours' results is 36. Using the default multiplier of 2, the alert will fire if the next test round returns a response time greater than 500 + (36 * 2) = 572 ms.

  • Percentage change - Using the default percentage change of 20%, the alert will fire if the next test round returns a response time greater than 500 + 20% = 600 ms.

  • Absolute value - Using the default absolute value of 50ms for HTTP tests, the alert will fire if the next test round returns a response time of 500 + 50 = 550 ms.

The different options allow you to adapt your alerting framework to better reflect the fluctuation in test results, and ensure that your system isn’t overwhelmed with alerts because of static metric baselines.

Three-Hour Window for Standard Deviation Dynamic Baselines

When your dynamic baseline is calculated based on a standard deviation, the ThousandEyes platform looks back at a three-hour window. (The average baseline is still based on the 24-hour window.) This is to ensure that any deviation from the mean is in line with the most recent data points and mitigates the risk of an extremely steady standard deviation over a 24-hour period, which would prevent an alert from firing that might be of interest to you.

Creating Alert Rules with Classic Dynamic Baselines

The image below shows an example alert rule that uses a classic dynamic baseline. The alert rule condition states that if, within the last 24 hours' average, the response time exceeds the last three hours' standard deviations by a factor of two, the alert will fire.

Mitigating Noisy Standard Deviation Alerts

Dynamic baseline alerts that are based on standard deviation can be very "noisy" for metrics with a small or very stable average. For example, the standard deviation of latency for a service could be less than 1ms. If your service jumped from 20ms to 20.4ms, this isn't inherently cause for concern, but if you have configured a sensitive standard deviation alert rule, alerts could fire regularly and increase noise.

When you configure an alert rule with a standard deviation-based dynamic baseline condition, we recommend that you add to the alert rule an additional alert condition with an absolute difference from average. For example, you can add a condition that says latency > 5ms above the mean. This will ensure that your alert rule will only fire if it is above the standard deviation and above a certain absolute threshold compared to your average.

Alternatively, you could change to a quantile calculation with a low sensitivity rating. This calculation and rating is designed to reduce “noise” and trigger on only the most anomalous of data. See quantile dynamic baselining for more information.

Last updated