Using Dashboards to Tell a Story
An effective dashboard is one that you can act on, with clarity and confidence. ThousandEyes Dashboards show customized live views of your data: dashboards allow you to see what’s going on at a glance, coupled with ThousandEyes Alerts to know when immediate action is needed. As you build out dashboards for your team or other groups within your organization, here are some general principles to help guide your thinking.
This article assumes a basic familiarity with ThousandEyes dashboards. Although our dashboard templates allow you to bypass much of this article, understanding the underlying concepts will enhance your tuning and creation capabilities. If you want to understand the thinking behind it and customize these templates, you’ll want to know how to create effective custom dashboards.
To see these design principles in action, try adding a dashboard from one of the ThousandEyes dashboard templates as described in Using the Dashboard Templates. You’ll also need to add some tests so your dashboards have data. Alternatively, you can enable a few of the free pre-configured tests from the Shared by ThousandEyes menu option, available from the Sharing > Shared by ThousandEyes menu.
Operationalize ThousandEyes by Telling the Right Story
Building a great dashboard requires thoughtful choices. When creating your dashboards, there’s an art to choosing the right data, visually arranging it, and most importantly – telling the right story.
Start with a problem statement to serve as your dashboard’s focus. Choose actionable metrics and readable widgets. Use an integrated strategy that aligns with your troubleshooting flow and encompasses both dashboards and alerts.
Dashboards and alerts allow you to “operationalize” ThousandEyes, in different ways.
Alerts can go to paging and ticketing systems where on-call support engineers can react to them quickly.
You can also use ThousandEyes Alerts to trigger automated remediations in other systems. See our docs page on Event-Driven Ansible for an example.
Dashboards can be part of regular team meetings and discussions, where the story told by the dashboard can’t be captured in a single alert. They can also show trends so you and your teams can become more proactive.
Use Dashboard Templates to Begin the Story
This article cites examples from ThousandEyes dashboard templates, which are more like blueprints that you can use as a starting point. The ThousandEyes dashboard templates aren’t quite the same as the personalized “stories” in this article.
Think of dashboard templates as precursors to your stories, i.e., curated collections of elements that you can, using your imagination, adapt to suit the story that you want to tell. The dashboards are not the story itself; what matters is the actions that you are now able to take as a result of the insights on that dashboard.
What Makes for a Good Story?
Dashboard creators sometimes place too much emphasis on “adding lots of widgets”. An effective dashboard is not about how many widgets you use, or how much data can be crammed into a single widget or dashboard.
A dashboard should tell a story, with the best possible use of widgets and metrics to derive effective conclusions.
A good dashboard should:
Tell a story that you can articulate
Address a clearly defined problem
Have an actionable business impact
In other words, the story consists of a general situation where a dashboard can quickly expedite solving a problem, and an example scenario to test or demonstrate how it works in practice.
Brief Story Examples
Here are a few examples of how dashboards can help your teams stay on top of things:
Periodically, you see things that look like potential outages on a site, service, or device (a URL or a network address) that you’re monitoring from multiple locations. Whenever this happens, you need to know how widespread the problem is, and whether it’s primarily an application-layer problem or a network-layer problem. You have 20 agent-to-server tests, originating from different sites, and you don’t want to start by looking at every test view. To expedite the journey to resolution, you create a single overview dashboard to address all sites.
Your SRE team monitors 30 BGP prefixes, on an ongoing basis. Part of your standard operational workflow consists of discussing a series of dashboards in your weekly team stand-up meetings. Based on what you observe over time, you can establish a baseline in order to better understand anomalies, deviations from normal, and isolate points of concern.
Dashboards and Alerts Are Complementary
ThousandEyes alerts and ThousandEyes dashboards aren’t the same thing. They complement one another.
Dashboards are more about seeing the current state of the world. Think of it as similar to a building fire safety audit, making sure the alarm is online and the sprinklers are working. In addition to reviewing the current state, you can use a dashboard to quickly see deviations or changes over time.
Alerts are more about fighting fires than preventing them. For example, a fire alarm went off and the sprinkler system was activated. Now you need to figure out if it’s a real fire, and if so, where to send the firefighting brigade.
Both dashboards and alerts include some retrospective capabilities.
Dashboards are normally current, showing the most recent test rounds. For history, you can filter up to the past 60 days.
Alert history filters can show up to 90 days.
Both dashboards and alerts can be configured to show degradations as well as complete outages.
You can embed individual dashboard widgets inside another application as described on Embedding Dashboard Widgets in External Web Sites in order to share information with other groups outside your organization.
Other Dashboard Design Tips
This section is a collection of best-practice suggestions for your consideration.
Use Visual Symmetry
At times when you can show two widgets side by side, by choosing the half-screen option on the widget configuration. For example one of the story examples suggested earlier involved monitoring the same application server target URL from multiple locations using an HTTP server test, and then comparing the service availability with the network availability.
The same type of widget side by side provides symmetry. Less obvious is how to make those widgets congruent so that the top problems on each side are sorted to appear at the top of each widget. Then, you can scan them to see if anything seems to line up.
“If your dashboard doesn’t look pretty, no one’s going to use it.”
– ThousandEyes solutions engineer
This design principle uses adjacency and congruence: to put things that are meant to be compared close together, on the same page, and in the same positions so that the user doesn’t have to waste mental energy on scrolling and context-switching.
Tell the Story in the Name
Use each widget title to tell your users (your customers) what they are supposed to look for, for example:
HTTP Error Phase (TCP = likely network problem Rest = likely application problem)
The intent is to direct the user to answering two questions:
Is it the network? Check for TCP-related errors first, because if network traffic is blocked, then you won’t be able to connect to the application at all. Is it the application? If the network looks normal, but you’re seeing HTTP server errors, look for “the rest” as in “look at everything else”
Another descriptive dashboard widget title with a very specific action item could be:
Packet Loss - last hour (If > 5% mesg NetOps)
Start with Most Recent Test Round
By default, dashboards show only the most recent rounds of test data, although that depends on your test intervals, the dashboard's time interval, and refresh settings.
Your dashboards might not automatically refresh unless you toggle it on in the upper right. You can customize the refresh rate upon opening the dashboard page, from 2 minutes up to an hour. (Enabling dashboard auto-refresh is not necessary for this discussion.)
Next to the refresh rate is a history setting for the total time range of data to be shown on the dashboard, from the last time the dashboard was refreshed. Dashboard history defaults to 24 hours, but you can set it up to 60 days or further customize the history with a fixed time interval.
Your test interval should make sense given the dashboard refresh rate. There’s no point in refreshing a dashboard every 2 minutes if the tests on it only run once an hour.
Scroll Down to Drill Down
A fine balance exists between too much scrolling on a single dashboard, and having too many smaller dashboards that become difficult to manage. While you can create a one-page dashboard that doesn’t require scrolling, that dashboard could end up being overly specific and tailored for a small audience.
A dashboard should go from a higher level to a lower level as you scroll down the page. If you see an issue on the top level, under overall health, you can scroll down to dive into more specific data, with the ultimate goal of drilling down into the underlying ThousandEyes tests. These tests, or the results from these tests over time, are the ultimate source for the data shown on the dashboard.
Some designers prefer that all page designs remain short because they feel that scrolling defeats the purpose and that if the dashboard itself is too long, people won’t open it. You can address this concern by front-loading the “summary” information at the top. Users aren’t forced to scroll to get the bigger picture, but they can look further if they want to see more granular data.
Use Filters for Further Refinement
Additionally, dashboard filters allow you to view specific data sources and sets of tests or agents, etc. Filters also make it easy to reuse a dashboard and simply customize what is reflected in it based on your selections. You can also:
Apply a temporary filter to a dashboard you’re already viewing
Save this temporary filter for later
Load a saved filter
Load a saved filter from a dashboard that was created using a dashboard template
Fine-Tuning Your Dashboards
In addition to the best practices above, the following items came from our teams who often create dashboards for demonstration purposes, and thus need to communicate effectively to new audiences.
Avoid using the Color Grid widget for time-based metrics
Why: The Color Grid widget uses the same coloring threshold for all cards, which can be misleading if some test targets are always slower than others. For example, a 100ms latency for a particular test target could appear on a dashboard in red, indicating a problem, even when a 100ms latency for that particular test target is within the normal range.
Do this instead:
Use a Number widget to represent metrics involving time.
If you don’t care about color-coding, you can use a Table or Multi-Metric Table widget when displaying time-based metrics.
Use smaller buckets of time to represent your data (eg: 5 minutes)
Why: A smaller time bucket will likely represent the latest round of test data and thus is more actionable. Metrics tend to become diluted in magnitude when using larger buckets of time. It is always useful to start small, and then look at larger buckets to see historical trends. Some of the dashboard templates have the same graph in larger and smaller buckets side by side.
Do this instead:
Use a Number or Table widget (with comparison to previous time span switch turned on) next to a Time Series widget to represent time-based metrics.
For example, suppose you care about latency. The Number widget allows you to see if the latency has spiked over multiple test rounds. And if latency is spiking, the Time Series widget will allow you to see if this sort of latency spike is happening frequently.
Use intuitive widgets
Why: Some widget types are harder to interpret. In particular, the Box and Whiskers widget may be less intuitive for a generalist.
Do this instead: Use a Multi-Metric Table. In most cases, the data in a box and whiskers widget can be easily represented in a multi-metric table. If you do choose a different widget, make sure it serves your audience requirements.
Separate your agents if using maps
Why: Metrics like packet loss can end up being difficult to interpret on a map widget when including multiple agents in proximity. The multiple agents will aggregate on the map into a single location circle. The warning colors also aggregate, making this metric less useful on a map.
Do this instead: Separate out the agents you’re including on the dashboard so that the map shows one agent per location, or there is parity amongst the locations shown.
Don’t try to use every available widget
Why: Too many widget types are more taxing, creating cognitive overload for the user to interpret.
Do this instead: Often the most convincing stories are built using 2-3 widget types at most (number, color, and time series are quite commonly used).
Stack correlatable metrics for application and network
Why: Quick visual comparison to see if an issue or outage is application- or network-focused.
What to do: Stack the widgets top/bottom or side by side.
Choose ascending or descending sort order by metric
Why: You always want to see the problem site/test first, at the top. In cases where you’re comparing application and network metrics, those sorts will be opposite.
What to do: Use HTTP server tests that include both network and HTTP view layers. In the dashboard, have two widgets for the same set of tests, stacked side by side. Use one widget to show network packet loss, sorted in descending order. The other widget can show HTTP server availability, sorted in ascending order.
Limit the amount of data
Why: Too much data (tests or agents) can obscure the top violators. It doesn’t always make sense to see data from every test or agent.
What to do: Limit data using the limit to widget function when reporting on a large number of tests or agents (for example a single test that has a lot of agents assigned). Use sorting in tandem to see things like the top 10 violators, based on the selected metric. Try to include sources more likely to show a top violator, based on past issues or patterns of failure.
Create alerts that reflect Service-Level Agreements (SLA)
Why: This is one way to report on SLA alerts, which measures the amount of time there wasn’t an active alert. If the alert inactive time is 99%, it means the SLA was met 99% of the time.
What to do: Create alerts that trigger when an SLA is violated. Then create a dashboard that reports on the percentage of inactive time of these alerts.
Limit BGP monitors in dashboards
Why: Given the sheer volume of BGP monitors, including them all would make the dashboard an infinite scroll.
Do this instead: For BGP dashboards, include data from maximum 8-10 BGP monitors.
Include two dashboard widgets side by side, one for path changes, and one for reachability. Sort your data in ascending order for reachability, and descending order for path changes. Showing 8-10 monitors should be enough to conclude whether the issue is closer to the origin ASN or upstream provider.
Choose short test names
Why: Some dashboard widget types have to truncate names due to space limitations, so having the most important part of the name first helps identify the test.
What to do: For long test names, have the most important or unique part of the name first. It is recommended to keep names short; however, every organization is unique and there may be other conditions that require more elaborate naming conventions.
Dashboard Templates in Use
The following examples show some dashboard use cases, based on ThousandEyes dashboard templates. See Using the Dashboard Templates for details.
SaaS Health
Web Server Health
Web Application Health
The Situation: A multinational bank headquartered in the United States has already deployed ThousandEyes network assurance in data centers, ATMs, and consumer branch offices and has a financial services portal including online banking, loan applications and payments, several mobile apps, and a wealth management division.
The Internal Helpdesk team uses the SaaS Health dashboard to monitor HTTP server availability for critical business applications like ADP or Workday. The dashboard is intended to be used with HTTP server tests on cloud applications that the organization doesn’t own.
The Customer Experience team uses the Web Server Health dashboard to monitor the availability of the financial services portal from various locations around the world. It’s similar to SaaS Health except that the HTTP server tests point to their own application servers.
The Portal Application team uses the Web Application Health dashboard to test web page loading for their own financial services portal. They are responsible for actual functionality, not just server availability, and they want to know if all the components are loading on the home page. This time, the team needs to know the full web page load experience, so an HTTP server test is not enough. Therefore, this dashboard is designed for use with page load tests.
Web application health is more complex to test than simple server availability because web page components could include Javascripting, chatbots, or font servers. For example, a Javascript change could dramatically increase the page load times from 2ms to 30 seconds, or a revenue-generating third-party ad server could fail to load.
Realistically, a team that monitors web application health would also be likely to use an API Health dashboard with API tests that monitor API endpoints critical to their primary web application.
Last updated