POSIX Extended Regular Expression Syntax
ThousandEyes implements the POSIX Extended Regular Expression syntax implemented by Unix utilities such as awk and egrep for content-checking in ThousandEyes HTTP Server tests and other Web Layer tests which use the HTTP Server view. The instructions and examples below are provided to help users create regular expressions to use the Verify Content feature of HTTP Server tests.
A regular expression is a pattern that is evaluated against a target. The evaluation of a regular expression returns either a match or no match.
Regular expressions contain literals, special operators, and boundaries, which are combined to make patterns. We'll run through a quick synopsis and example of each below.
Standard characters match themselves (literal matches). Additionally, there are special characters for matching digits, words, and special characters. You can use a list of characters to match, or specify a class of characters. Each expression will match exactly one time, unless a repeater is used following the pattern.
A pattern is a group of characters. A simple pattern can be a single character or a character class, along with special operators, such as repeaters or alternates. The following examples show some of these simple patterns:
Patterns can be grouped or nested using parentheses, and can contain repeaters, alternates or negation, or tied by to a boundary by an anchor:
The following characters are special within regular expressions:
Matching these characters literally requires escaping using the backslash character. For example, the . (dot) character matches any character, by default. To search for a literal dot character, you must escape it:
Similarly, to search for a backslash character, specify two consecutive backslash characters (\) in your expression.
Any pattern within a regular expression can be repeated. There are several types of repeat options:
A match can be created by not matching a list of characters (for example, [A-Za-z]). Negate the characters inside the list by using a ^ (caret) character before the list, such as:
The above expression will match any character not in the list of alphabetic characters (upper- and lower-case).
Alternation refers to using OR logic. To alternate between two sets of characters, use the '|' (pipe) character.
There are two types of boundaries which can be specified in a regular expression: line and buffer boundaries.
Buffer (Input) Boundary
A buffer boundary refers to the entire content retrieved - ie, the entire page retrieved during an HTTP server test.
A line boundary refers to content found on an individual line of input - on most systems, a line boundary is marked by a newline character found within the input. When working with line boundaries, if you're only interested in a line that matches some exact text, use the caret (^) character to specify the beginning of a line, and the dollar sign ($) character to specify the end of a line.
\b matches a word boundary (the beginning and end of a word).
Web pages utilizing Unicode characters typically use UTF-8 encoding. UTF-8 encodes characters using strings of between one and four bytes. To match the a UTF-8 encoded character, you need to match the byte representation of the character. For example, the Unicode character ž is UTF-8 encoded as the two-byte sequence C5 BE. To match it with a regular expression, use \xC5\xBE
Who remembers the BEDMAS acronym from 8th grade math class, signifying order of operations in basic arithmetic? (Bracket, Exponent, Division, Multiplication, Addition, Subtraction). Yes, as with most computing technology, we need to be aware of order of operations, aka operator precedence.
Regular expressions are case-sensitive. "This is a Test" is not equivalent to "this is a test".
Because a ThousandEyes Web Layer test is reading content from web servers, be aware of any HTML markup in the target you're attempting to match with the regular expression. Use the View Source option in your browser to check the content for HTML characters. For example, the following string:
could be represented by the following HTML:
which would NOT match the following regular expressions:
foo bar foo.bar foo\sbar
We strongly recommend validating your regular expression prior to creating your test, particularly if assigning a content Alert Rule to the test.
$ curl -s target_web_page_URL | egrep 'pattern'
If the commands display output, lines matching the query will be returned. if no output was displayed, there was no match.