Multi-Path Regular Expressions

(MPRegex) Language

The Multi-Path Regular Expression engine (MPRegex) is designed to aid in developing systems which extract collections of useful features from streams of data at high speeds.

There are a few key differences between MPRegex and more traditional regular expression implementations:

Atoms

Atoms are the smallest unit of a regular expression. For example single characters are atoms. However, character classes are also atoms.

In any expression an atom represents a single character matched in the data being analyzed. In the case of an ordinary character, the atom matches one character in the data that is identical to the one character in the expression. In the case of a character class, the atom matches one character in the data that is equivalent to any of the characters in the character class.

Characters

Characters are the most basic element of MPRegex syntax. With the exception of meta characters or characters that have been redefined, each character matches itself in a given pattern.

When a Meta Character must be used literally in a pattern it can be injected by "escaping" it with the backslash ( \ ) meta character.

When a character that is not in the current character set must be represented in a pattern then it can be injected into the pattern using the ampersand ( & ) character followed by the numeric code for the character in either decimal or hex, ending with a semicolon. For example, the ascii carriage return character might be represented as &13; in decimal or perhaps &x0d; in hexadecimal.

In MPRegex you could also represent a character is with a friendly mnemonic using a named character class. We will cover named character classes later, but to eliminate some of the suspense an example might be: `[cr]

Meta Characters

The following is a list of meta-characters in order of precedence, their meanings and uses:

` Back tic is used to invoke special command functions and features such as defining or invoking named functions, expressions, quantifications, character classes, and definitions.

\ Backslash is used to escape meta characters and redefined characters so that they are interpreted literally. (A redefined character is a named character class where the name is one character long... more later) Unlike conventional regular expressions, in MPRegex the backslash character has no other special meanings. It always means - match the next character literally.

& Ampersand is used to insert a numerical character description in either decimal or hexadecimal. notation.

[ Open square bracket is used to mark the start of a character class or subclass. It is paired with the ] which marks the end of a character class or subclass.

{ Open curly brace is used to delineate the start of a quantification. It is paired with the } which is used to mark the end of a quantification.

( Open parenthesis is used to delineate the start of an expression or subexpression. It is paired with the ) which is used to mark the end of an expression or sub expression.

! Exclamation point (bang) is used to describe an alternate path that should not be followed. (more later)

| The vertical bar (pipe) is used to describe an alternate path that should be followed.

and sometimes

- The hyphen is used inside character class definitions to indicate ranges of characters.

, The comma is used inside quantifications to mark the boundary between the minimum number and the maximum number.

: The colon is used in definitions to mark the end of the name and the start of the expressions that make up the definition.

; The semicolon is used to mark the end of a definition when declaring a definition and the end of a definition name when invoking a definition.

< The left angle bracket (less than) is used in functions to mark the end of the name and the start of the parameters. It is paired with the right angle bracket >.

Character Classes [ ]

Character classes are used to define a single atom that may match more than one character. A character class may be used anywhere a single character can be used and since it is an atom it will only match one character in the data. Some examples of character classes are:

[abcdef] Which represents any one of a, b, c, d, e or f. This can also be written using a hyphen as [a-f]. The hyphen has a special meaning within a character class indicating "all of the characters between". This means that we can write character classes such as:

[0-9] Which represents all the digits 0 through 9.

[0-9a-f] Which represents all of the hexadecimal digits (lower case).

In MPRegex, a great deal of effort has been spent on making sure the syntax is self-consistent so that it remains intuitive. As a result there are a number of things that we can do with MPRegex that cannot be done with more conventional regular expression languages.

For example, in MPRegex we can also use the hyphen at the beginning or end of a character class to represent all characters less than a given character as in [-0] meaning all characters less than 0, or all characters greater than a given character as in [z-] which means all characters greater than z.

Another difference is that the meta characters are ALWAYS meta characters - even inside character classes so they must be escaped if they are used. At first this seems inconvenient, but after a while it becomes clear why this is useful. For example, this allows us to build MPRegex classes out of sub classes which provides more clarity:

We can also use nested classes to help clarify our expressions. For example [0-9a-f] could be rewritten as [[0-9][a-f]].

Also, where conventional regular expressions use ^ to indicate the beginning of a line and then reuse the ^ in character classes to indicate characters that are excluded; in MPRegex we use ! which always means "not included" whether it is used in character classes or in grouped expressions.

As a result we can write [[a-z]![mnop]] to indicate all lower case letters of the alphabet except m, n, o, or p.

It is also possible to include the | character between preceding included groups of characters, however that is usually not needed as the | is implied. To illustrate, we could rewrite [0-9a-f] as [0-9|a-f] without changing the meaning.

Named Things

You might wonder why we would stop here to talk about naming things (element naming). The reason is that it's time to talk about named character classes; but while we're at it we're going to introduce the idea of named expressions and named quantifications at the same time!

In MPRegex we have the ability to label character classes, expressions, and quantifications with convenient mnemonics. For example, you might create the following mnemonics for common character classes:

We can define `digit[0-9] after which we can use `[digit] where conventional regex syntax might use [[:digit:]]

We can define `lower[a-z] after which we can use `[lower] where conventional syntax might use [[:lower:]]

We can define `upper[A-Z] after which we can use `[upper] where conventional syntax might use [[:upper:]]

Another special feature of element naming is that any single character name can be invoked using just that character for as long as that expression is in scope. (More on expressions when we talk about parenthesis.)

In other words, if you wanted to redefine the lower case letter a to be case insensitive you would define the character class `a[aA]. After that any use of the character a would match both lower ( a ) and upper ( A ) cases.

Note: It is possible to invoke definitions like this "out of band" so that they don't have to be created within the expressions themselves. This helps to keep the syntax tidy and helps promote reuse. Note also that named expressions have a scope that is limited to the expression within which they are defined. This allows redefined characters and other named expressions to "revert to normal" once they have served their purpose. More on that later.

Parenthesis ( )

MPRegex syntax is broken down into expressions and sub-expressions. More precisely, expressions in MPRegex can be made up of atoms and other expressions.

Parenthesis define the boundaries for all expressions. In fact, any MPRegex expression has an implied set of matching parenthesis surrounding it. The opening parenthesis define the entry point of an expression and the closing parenthesis define the exit point of an expression. This allows all expressions and sub-expressions to be more easily treated like modules of code that can be reused in other expressions, related between expressions, or conserved by the implementation's engine if so desired.

Expressions can be named just like character classes. For example we might define a named expression at the beginning of an expression and then use it several times as in the following example:

(Converting `money(yen|dollars|pounds) to `(money) or `(money))

which matches the text

Converting dollars to pounds or yen

It's easy to see how named expressions can clarify MPRegex syntax but they can also be used to compare segments of data. For example the expression:

(`FirstDigit(#)##-####, `<FirstDigit>##-####)

would match

555-1234, 543-9876 and 123-4567, 199-2941

but not 123-4567, 987-5432

To simplify this explanation we've assumed that # has been defined as a wildcard for decimal digits as in `#[0-9]. MPRegex does not define any wild cards by default. However there is a set of wildcards listed later that have been assumed in the example we provide here.

Another application of named expressions is the extraction of data for an application. A discussion of this is beyond the scope of this document, but the gist of it is that the MPRegex engine can be used to identify the index and endex of every expression (named or not) within a given stream of data. The calling application can then extract just the parts that it needs without spending a lot of additional effort on parsing the data stream.

Just as in character classes we can name an expression with a single character. For example, consider the named expression `W(${1.}). Assume that $ has been coded to represent any letter. In that case then W could now be used to represent any string of letters - perhaps a "word". Then a 4 word sentance might now be coded like:

(W W W W.)

For a more useful example, consider `^(&x0d;{0,1}&x0a;) which is an expression that allows us to use the ^ to represent the beginning or end of any line.

What is important about these two examples is that the patterns matched by the redefined character can have variable lengths in the data stream. When we redefine a character using a named character class we still always have a single character match. When we redefine a character using a named expression we can use that character to match any expression.

Quantification { }

Path Selection

Named Definitions:;

Wildcards

By default MPRegex does not define any wildcards! It is assumed that each implementation will create a set of wildcards that suite the application. That said, the following default wildcards would probably make sense for many applications had have been used in the examples in this document:

`#[0-9] for decimal digits

`$[a-z|A-Z] for letters

`^(&x0d;{0,1}&x0a;) for new line (start or end of a line)