Understanding Quality
Quality refers to the overall quality of network connection as experienced by users: loss, latency, and jitter. As a metric, it expresses the percentage of time that your network is determined to be within SLA (Service-Level Agreement) – including a safety margin. The thresholds for these measures are defined separately by application category as described in Application Categories.
Path Quality
Path quality is an aggregation of three network performance measures (loss, latency, and jitter), expressed as a single weighted percentage. The closer any one of these network measures approaches to its threshold, the greater the risk that network performance issues could exceed or violate the quality thresholds in the near future. Exceeding the threshold is also known as an SLA (Service-Level Agreement) violation.
By improving path quality, you reduce the chances of providing poor service to end users. WAN Insights shows quality as a percentage, so a perfect quality score is 1 (100%). Measured quality is 1 minus the estimated probability of experiencing a problem.
Another way to say it is that quality represents the statistical chance of a network NOT experiencing a single SLA violation. For example, a path quality of 97% means that your network is most likely going to remain within SLA thresholds 97% of the time.
The separate thresholds for loss, latency, and jitter constitute the service-level agreement or SLA, which is the standard that network performance must meet. Failing to meet this threshold does not mean the network is “down”; it just means network traffic isn’t moving as well as it should. The service-level agreement sets expectations for quality of service (QoS). These raw performance measures are the building blocks from which the path quality metric is ultimately derived.
However, simply meeting the SLA isn’t enough. We want to feel confident that performance will continue to meet this threshold in the future. Path quality is really a measure of how safely you are within the three QoS thresholds.
The path quality metric is shown on the recommendation cards and the site summary views, with comparisons between default (actual) vs. recommended path quality.
Path quality is sometimes referred to as the quality of experience score, or QoE score. Note that QoE is distinct from the QoS measures shown on the Hourly Quality of Circuits portion of the Recommendation Details Modal.
Current vs. Recommended Path Quality
Path quality is shown on WAN Insights both with and without a site recommendation. These two qualities show on every screen within WAN Insights, with different rollups depending on whether you’re looking at a high level, or drilling all the way down to an endpoint pair.
Current path quality or Current quality is the actual quality that is happening now, also known as the observed path quality. If no site recommendation exists, the current path quality is the only thing shown.
Recommended path quality or Projected quality is the quality that could potentially be realized if the recommendation from WAN Insights were to be acted upon.
Quality of Service (QoS)
Quality of service measures network performance at the circuit level, between endpoint pairs. The raw measures are evaluated against defined quality thresholds. Quality of service is measured separately for loss, latency, and jitter, as a weighted percentage that expresses how close each performance measure is to its threshold. This weighted percentage is the QoS score.
For example, if the latency threshold is 300ms, and actual latency is observed to be 299ms, that’s not good enough for a perfect quality score. Ideally, the latency should be well below 300ms. Even if it’s below the threshold, but close to the edge, the user experience may be suboptimal and there could still be room for improvement. In this example, a latency of 300ms would translate to a quality of service score of 50%, whereas a latency of 250ms would have a score of 80%, and a latency of 200ms would score around 95%.
The frequency of measurement is described below.
Network probes run on edge routers.
For P2P tunnels, a bidirectional forwarding detection (BFD) probe measures loss, latency, and jitter on every tunnel in the SD-WAN overlay. The default interval is 1 probe per second. This bidirectional probe measures round-trip time (RTT), so congestion on either ingress or egress can potentially impact quality.
For Direct Internet Access (DIA) or Secure Internet Gateway (SIG) tunnels, WAN Insights uses Cloud on Ramp (CoR). CoR sends 10 HTTP pings (1 per second), and then is silent for 20 seconds.
WAN Insights performs calculations based on the averaged probes over a 10-minute interval. Every 10 minutes:
The network probe measures are averaged separately for each measure.
Each average is then assigned a quality of service or QoS score. There are separate QoS scores for loss, latency, and jitter.
All the QoS scores are rolled up into a single metric for path quality.
You can see QoS breakouts on the Recommendation Details Modal screen under Hourly Quality of Circuits. Meanwhile, the QoS scores for loss, latency, and jitter are rolled up into the aggregated path quality score shown on the recommendation cards and the site summary views.
Quality of Experience (QoE)
The quality of experience (QoE) score is a calculated score based on aggregated quality of service (QoS) measurements for loss, latency, and jitter. The QoE score is the final rollup that informs the Current path quality and Recommended path quality shown on the WAN Insights user interface. Note that the QoE score indicates a level of risk and corresponding potential for improvement, not actual outages.
A low average score could indicate a temporary network outage, or that services are available but do not meet the thresholds. A user might see a QoE of 5% and confuse it with actual throughput. A throughput of 5% would be exceedingly poor, and would probably constitute a network outage. A QoE of 5% isn’t the equivalent of saying that only 5% of your packets are successfully transiting the network. It’s saying that network performance is almost never meeting the thresholds. This would constitute an SLA violation, even if network services were still available.
Last updated