Regular Expression Parsing (Regex)

I Wrote this to support my article on geolocation of IP addresses which is useful for security professionals when making use of published blacklists or when analysing security logs.


Regular expressions are patterns. These patterns match text.

I will display regex patterns in bold.

The simplest kind of pattern is called a 'literal' and here is an example:

Age

It's not clever but it's very specific. It will only match the word 'Age' including letter case. You might use this to find what is called a 'delimiter' within a structured collection of data. An example set of structured data could be:

name: Bob, Age 28

Bob is considered "data" as is 28, but the rest of the stuff makes up the structure. In particular, spaces and commas, colons are delimiters because they surround the individual parts that we call data. In general, delimiters are consistent throughout an entire set of structured lines while the data is independently variable.

In the example, name is also a delimiter, and it marks the type of data to follow. In this case, because we know what name means, we expect the data to follow is a proper-noun. In the second case, Age describes an integer greater than or equal to zero. It's upper limit is context dependent as we could be talking about people or rocks or butterflies. This level of sophistication is beyond control of regular expression parsing.

Parsing is what we call the process of breaking structured data into delimiters, type declarations and data.

These type declarations (which are also delimiters) and the punctuation like spaces and commas are both literals because they appear literally in the source text.

So our simple regex pattern

name:

could be used to match the first type declaration, including the comma.

Imagine an invisible pointer has moved to point to the space immediately following the literal name: once we apply the regex pattern name: to the string.

If we could then skip past that space, and gather up all letters that follow it (until the next delimiter which happens to be a comma), then we will have extracted the piece of data: Bob

It is this ability to scan and extract data that makes regular expressions very useful.

Of course, we could match Bob with another literal pattern Bob but that's not very useful... unless everyone is called Bob.

So we need to find a regex pattern that matches any letter. At first, let's assume that all the data is in lower case to make this easier.

name: bob, Age 28

We could use a pattern like name: [a-z]+

and our invisible pointer would now be advanced to point exactly on the comma following the word bob. The brackets [] group literals as options into one pattern. In this case the literals are a b c d e f g h i j k l m n o p q r s t u v w x y z but that's boring to type, and inefficient so the shorthand uses the minus-sign to indicate a range from a to z. The + after the last bracket means "Let's allow this multiple times" and it is known as a metacharacter. It would also work for

name: bill, Age 28

but not

name: Bob, Age 28

because the last example uses a capital letter and we need a regex pattern that includes upper case letters.

Regex comments

When supported, try to comment your regex because otherwise it can get hard to read for a new programmer who happens upon it.

The syntax is: (?#Your Comment)

You can use anything after the # except the right parenthesis which closes the comment.

Free-spacing

You will quickly notice that regex gets all bunched up because spaces are taken as literal patterns to use a search criteria. It's possible to switch to a mode where these spaces are not processed and that can make your regex easier to read. This is called free-spacing.

The mode switch is (?x) within a regex pattern, or it could be a library-call or tickbox etc depending on the engine and language in use. When free-spacing is in play, # starts a comment

# This is my comment
(a|b)            # Look for a or b
cat              # followed by the literal word "cat"


To match spaces, use character classes, \s or escape them.

Options

? means the preceding regex token is optional.

() collects tokens into one container so that is applied to all of them.

e.g. 1(23)? matches either 1 or 123.

* means to match the preceding token ZERO or more times.

+ means to match the preceding token ONE or more times.

a{2,5} means to match the letter a twice, three times, four times or five times. Therefore a on it's own would not match, and neither would aaaaaa because there are too many.


Character sets.

Just like in set-theory, you can think of a pattern like [abcdef] as a set. This pattern would match ONE a or ONE b or ONE c but would not match a k or a z.

Ranges can represent one element in the set and this allows you to concatenate them as in:

[a-z0-9A-Z]

and this would match any single number or upper case letter or lower case letter.

A powerful modification is performed with the caret symbol as the first in the set. This negates the match so that

[a-z0-9A-Z] returns everything that is NOT a letter or number. Sometimes it's easier to specify what you don't want to match compared to what you do want to match. In particular, if you have a small finite, well defined idea of what you don't want to match, then everything else is, by definition what you do want to match. It could well be, that there is no easy way to know what you do want to match, or perhaps the set is too large and complex to easily specify.

When inside [ ], there are only four special characters and these are \ ] ^ -

You don't need to escape the other special metacharacters listed below when inside square brackets.

You don't need to escape the caret unless it's at the start of the regex set. So make a habbit of placing it somewhere else if you want it as a literal element.

Similarly, a closing bracket as the first character does not need to be escaped since the regex engine is smart enough to know that the null-set would be useless anyway. So

[]abc] is the same as [ab\]c]

The range-token (The hyphen) needs a parameter to its left and right to make a valid range and therefore you can get away without escaping it if you put is at the beginning or the end as in these two examples:

[-abc]

[abc-]

These tricks/rules lead to some curious behaviour as in the comparison between the patterns

[-^] and [^-]

They are very similar but have incredibly different behaviour. The first case would match any hyphen or caret, while the second would match all characters except a hyphen.

The pattern [^-a] works as expected. It will match any character that is not an a or a hyphen. This is because the 'range' ^-a would be nonsense. Ranges only work on what are called 'ordinals' which is a fancy way to say that they have a position in a well defined sequence.

Metacharacters

'Meta' means 'about', so a metacharacter is a character that means something about characters. It's a character with special meaning. The + in the example above can be thought of as a computing procedure. In this case, it means to find several matches of the preceding pattern. It stops as soon as there are no more matching characters in the parsed data string.

The immediate question that comes to mind is, "How do you match + ?" To do this we can precede it by another metacharacter called an escape-character. This is a backslash. \

... and then you wonder how to literally match the backslash. Well, you simply escape that as well. Therefore, to match a backslash, the literal is actually \\

Luckily, this apparent nonsense is limited to one escape of the escape character. Unluckily I just lied to you because you might be processing a regex pattern using a computing program that passes it down to another level and by doing that would lose the escaping that you prepared. You can get around that by using four backslashes. But that's outside the scope of what I want to show in this article.

A list of metacharacters

  1. [ opening bracket
  2. ] closing bracket
  3. \ backslash
  4. ^ caret
  5. $ dollar
  6. . period
  7. | pipe symbol
  8. ? question mark
  9. * asterisk
  10. + plus
  11. ( opening parentheses
  12. ) closing parentheses



Can you escape any character?

You might think it's harmless to escape a character that is not a metacharacter, but this would cause problems. For example, the pattern \d is shorthand that means a single digit. You could use [0-9] instead, but \d is more compact.

Note - some implementations of regex treat { and } as special characters.

In many regex implementations, there is a special way to escape a huge chunk of data that contains multiple special characters. This is handy because otherwise all those backslashes make the literal hard to read. This trick is performed with \Q and \E which you can think of as "Quote" and "End quote".

Let's assume you want to match ^5$3hg\d+

This pattern would work: \Q^5$3hg\d+\E

.. and that trailing \E is optional since the literal terminates at the first non-match anyway.

This would also work \^5\$3hg\\d\+

but that's harder to read.

But some characters are invisible!

Yes - the tab character is hard to see. So is a carriage return, and a space. So these kinds of characters have special patterns.

\t tab
\r carriage return
\n line feed
\a bell ( ding ding !)
\e the escape character - as in what is generated when you hit the escape key.
\f form-feed
\v vertical tab


and finally, an ascii code can represent any special character -- like the copyright symbol:

\xA9


And then again, there is another shorthand trick that many regex implementations support. Let's say you want to represent Control-H or Control-Y as would be generated by pressing the control key on the keyboard along with either of those letters. The pattern would be

\cH

and

\cY

This is nicer to read than the numeric ascii code.

Unicode

Unicode uses an extra byte to allow for international character sets. For example, the European currency sign € has a code of \u20AC because \u means 'unicode' and 20AC are two hexadecimal 8-bit bytes that represent that character.

There is a big fat set of unicode tables available at http://www.tamasoft.co.jp/en/general-info/unicode.html

These codes let us write symbols like this: ᙏ -- providing the character-set in use allows this. However, if you are parsing data that could come from any source, then if the unicode is part of the data, then you can legitimately search for any of these codes.

Unicode characters inside a character set are allowed as in this example which matches a unicode character or the letter x:

[x\u20AC]

Greedy, Lazy or Eager, and engine types

There are two terms that you may see and these are 'greedy mode' or 'eager mode'. The background to this is a little complex. "Greedy" is applied to any search that continues to the end of the line looking for a best match, even if it found an acceptable match already. "Eager" means that processing stops as soon as a match is found, but BOTH will have to backtrack to try all permutations of the regex pattern.

There are two broad kinds of regex engine. One is called a Regex-Directed engine and the other is a Text-Directed engine [ note: text-directed engines do not backtrack or support lazy operators - see later ].

You can test your regex type by searching through this sample text:

Happy accident

using the pattern Happy|Happy accident

If you get Happy back, then it's a regex that is eager to return the very first string that satisfies any permutation of the supplied search pattern. If you get Happy accident back, it's greedy.

Greedy matches tend to be slower because they potentially process the same characters multiple times.

Why would you want a greedy match?

Sometimes, the first match is a match, but there could be a "better" match later on. In that case, 'greedy' is appropriate.

These regex patterns are greedy:

?
+
*
{}


You can force these to be lazy, ungreedy, reluctant or non-greedy. (All terms mean the same thing). It's done with a question mark after the greedy pattern.

e.g.

".+" matches "def" 123 "pqr"

in

abc "def" 123 "pqr" xyz

but ".+?" matches "def" then "pqr" as you probably intended.

There is an efficiency penalty for using greedy regex, or greedy regex that has been told to be lazy. This is because of the back-tracking under conditions where the lazy regex is forced to expand the search. So we can make a more efficient pattern using negation:

This "[^"]+" says, "find a quote, then give me everything that is not a quote until you find a quote. (At least once). There is no backtracking and in tight loops this will be noticeably more efficient.

Possessive Quantifiers vs Atomic Groups

Greedy and Lazy backtracking comes with a performance cost that is noticeable in tight loops. An implementation-specific quantifier called possessive may be available and allow you to prevent the engine from trying all permutations or eliminate certain matches.

These posessive quantifiers are most useful to speed up failures to match since this is typically when a lot of work might be done to be *really sure* that there is no match, and it's most noticeable with nested quantifiers. If we can be sure that an early failure without the backtracking is appropriate, then some performance improvement is available.

Here are two patterns where the second is identical but possessive and will NOT work as expected because the dot is able to match the trailing quote and backtracking is prevented.

".*"

".*+"

Atomic Grouping

a*+ is the syntax for possessive quantifier and the equivalent for an atomic group is

(?>a*)

... and those brackets define the atomic group so they are mandatory.

(a(?>bc|b)c will match abcc

If you can arrange your alternative patterns in an order such that the first is a super-set of the next which is a super-set of the next and so on, then failing the first one should terminate the search but unless you deliberately prevent the backtracking, the regex engine will try all patterns. Here is an example:

\b(Container|Contain|Cont|Con)\b will NOT match Containers (Because of the word boundaries) but the regex engine tries every pattern, using backtracking. We can make it fail as soon as the first pattern is found not to match Containers by using atomic grouping:

\b(?>Container|Contain|Cont|Con)\b

Shorthand

We already noted that \d is shorthand for [0-9]. There are some extra ones like for a word. But before you use these, you need to understand which characters are included in a word. It's tempting to think that a word is all capital and lower case letters

[A-Za-z] but in most implementations, at least the underscore and digits are included:

[A-Za-z0-9_]

\s is for a certain set of non-printing characters that we call 'white space' and these include tab or a line-break with a character set of [ \t\r\n] But this is just a guide. You need to know your particular engine to know whether vertical tabs or form feeds are included or whether new-line is excluded and so on. When you use unicode sets things are even more complicated because of the wide range of multinational characters. Typically, you can expect things like numeric operators, copyright or trademark symbols , quote marks and currency will be excluded from \w but in the end - read the particular documentation or try it out.

In this example \w+ matches words, and \s matches white space. We can see that the underscore is included in the definition of a word.
In this example \w+ matches words, and \s matches white space. We can see that the underscore is included in the definition of a word.

Negative shorthand

That's an odd concept! What we mean here is that you can use either [^\d] which means "Not a digit or you can use a more compact version \D

  • \D means NOT a digit
  • \W means NOT a word
  • \S means NOT white space

Repetition

Like any usable programming language, regex needs some control constructs to be useful. In most we have loops, like a "for" or a "do" loop. In regex, repetition is attached to a set as follows:

[0-9]+ will match 444 or 3333377 or 8978 etc.

A dot matches almost any single character except newline(s). This implies that you could use [^\n] but that's rather verbose and confusing. Unfortunately, in windows, [\r\n] is considered a newline so it's not really as clear-cut as it could be. You need to consider the platform form which the data was generated, or cater for both possibilities in your regex patterns.

To circumvent this historical line-mode searching and allow the dot to include newlines, there are various switches that you can apply to include newline depending on the regex implementation. Consult the relevant documentation to find this switch.

We can also get tricky. Recall the shorthand notation, and their negation. In the same way that the set of all positive numbers and all negative numbers make up the entire set of integers, The set of all spaces, and not spaces means any character. Therefore [\s\S] is a valid way to construct a 'dot' that matches any character including newline.

Only use DOT when you truly mean ANY character. If there is a delimiter or data with a limited set of possible values, then specify that set properly.

How good is good enough?

Never forget that regex is a programming language. It accepts data, performs algorithms and returns a result. With any code-block, the algorithm has limitations. For example the square root of -6 won't work if the input and output are integers. It will work if the algorithm caters for complex numbers.

With regex, a similar situation exists. Consider a date format. 01-12-2012

We don't know from looking at this whether is means the 1st of December or the 12th of January. So you need to know how the data is formatted as it comes into the regex. In some cases, the data that comes in is from a source that is very strict about the format and if that is true, then your regex can be simple because you can assume that the input data is always well formed.

However, if the data comes from a person and the input routine has no date-verification, then your regex could get a "date" in the form 99-12-0000 and a simple regex would fail to parse-out meaningful results. In regex, you don't want to get into a situation where something that looks like a date (but it not a date) is assumed to be a date.

The same argument holds for other kinds of data like people's age which should never be negative, and is always less than 150. There are similar problems to address for IP addresses.

Example date regex

Let's assume that a program generates a date in the form YYYY-MM-DD or YYYY/MM/DD and these dates are perfectly formed in every case.

You need a regex that will cope with two possible delimiters, either the hyphen or slash.

This will do it, and it's certainly not the only or even the best or most compact.

(\d+-\d+-\d+|\d+/\d+/\d+)

The round brackets contain a collection of alternative regex expressions. Each of those are separated by a pipe symbol. It will not match 2011-4/6 which is reasonable because you would hope that the same delimiter - either slash or hyphen would be used consistently.

Here is the parse-tree for this regex:

Regex parse tree for (\d+-\d+-\d+|\d+/\d+/\d+)
Regex parse tree for (\d+-\d+-\d+|\d+/\d+/\d+)

Double delimiters

Quoted strings are double delimiters and they can complicate regex parsing. Consider trying to extract everything between The first and last quote in the following:

yyy "abc "12"
frg" xxx

It's complicated by having quotes inside quotes, and a newline character.


Anchors

Anchors match position, not characters.

A string may contain newline characters. The length of the string is defined in various ways appropriate to the scripting language in use. For example, in C, a string is terminated with a null character, and therefore, a single string could contain multiple lines as in:

abc\ndef\nghi

^ matches the position to the left of the first character in a string. (But this can be extended in multi-line mode)

$ matches the position to the right of the last character in a string. (But this can be extended in multi-line mode).

Therefore the regex ^123$

matches "123" but not " 123" or "123 " or " 123 ".

There are fixed versions that cannot be extended:

\A - only ever matches the start of the string.

\Z - only ever matches the end of the string. But in many engines, there is a special case. If the last character read into a string is a line-break -- it acts as if the trailing line-break is not there. This is because some languages read in a line and (annoyingly) keep the terminating newline character in the string. It is convenient to allow the regex engine to assume that the newline character at the end of the string should not be there.

\z - Does the same as \Z but only ever matches the real end of the string even if there is a trailing newline.

In most regex engines:

\b - Forces a position match on a word boundary. Recall that the \w regex is shorthand for [a-xA-Z0-9] , the \b anchor forces the regex to NOT match substrings within words. Therefore in the string example below, the search regex \b44\b will match 44 before the :, but NOT the 44 before the underscore _ because the underscore _ is considered part of a word.

klsj44dlkjdskj kj 44_ 44: 4 lkj ldskj lsdj
(It only matches this ^^)


\B - this is the negated version of \b

But the scripting language TCL (and JGsoft) use \b for "Backspace" and \B to mean \

TCL uses \y and \Y

multi-line mode

In editors - like Emacs, ^ and $ always behave as described above, but in line editing tools and programming languages that call regex engines, you can often find a special switch that will extend the ^ and $ anchors to match before and after newline characters as well as the true beginning and end of the string.

(E.g. in Perl, the switch is an m.)

Matching modes

These are implementation-dependent mode switches. You will need to consult the relevant documentation for use. There are more (or fewer) depending on the language used.

/i  switches the mode to case-insensitive.
/s  is single-line mode (The dot matches newlines)
/m  is multi-line mode where ^ and $ match before 
    and after newlines in a string.


As an example, (?is) might turn on case insensitive and single-line mode. If this appears in the middle of a regex pattern then the effect is switched on for the remainder (to the right) of the pattern. (?-i) would turn off the modifier. To make things even more irritatingly complicated, some engines apply the switch to the entire pattern anyway and there is no 'off' switch.

(?i)y will match y or Y

The syntax in the two following example patterns are equivalent. The second (with the colon) is called a modifier-span.

(?i)xxx(?-i)YyYy(?i)xxx  .... would match xXXYyYyXxX and so would
(?i)xxx(?-i:YyYy)xxx



Alternation

Alternation means "this or that" and generally uses the pipe symbol |

When you use braces (), you can group these alternations and apply something like a word-boundary directive to all of the alternatives. E.g. \b(a|b|c|d)\b matches a, b, c or d only if they are whole words while a|b|c|d would match these letters even if they are embedded in words.

Sometimes, when the engine is eager the order of these alternatives is important. Therefore you need to list the most specific match first. e.g.

for the string:

a big brown cat is aa aaa

a|aa|aaa

will only ever be the same as a - that is, it will only ever match single letters of a. So in the example, there are 7 letter a, and the search will pick them one by one.

However,

aaa|aa|a is different.

It will match first, a, then the a in cat then aa, then aaa. This is most likely what you really wanted to do.

An alternative method is to use a list of optional matches like this:

a(a)?(a)?

Which exploits the greedy nature of the question mark.

A slightly different but predictable result is obtained with \b(a|aa|aaa)\b where this will match the first a, the aa and the aaa but not the one embedded in cat.

Backreferences

When you use (), it creates a backreference. It's the kind of thing you do when solving a maze, that is you come to a decision point, store a little information at the reference point to come back to later, and then make a decision. This comes with a performance penalty, and if you do not need backreferencing, then use ?: in the right position -- that is just after the opening parenthesis.

Bob(?:ajob)?

How can we use this?

This regex \b(\w+)\s+\1\b will find words that are repeated as in the following text:

Enter the the dragon.

Let's break it down:

\b    --> operate on word boundary
(\w+) --> match all of a word
\s+   --> skip past one or more whitespace
\1    --> This is where we exploit the backreference. 
 It recalls the stored and previously 
 matched word - in this case the word "the"
\b --> operate on word boundary

Note: You cannot specify backreferences inside a character class. This means that (s)[\1x] will not mean to reference a stored result.

Capture Groups

A group is that which is contained by parentheses.

h(a|o)t will match hat or hot.

You can nest groups:

s(h(o|a)t) will match hat or hot or shat or shot

In this case there are three groups. An implied group(0) which is the entire match, then group(1) which is h something t, and group(1) which is o or a. The way that you access these stored matches is dependent on the language used to drive the particular regex engine so you would need to read the appropriate documentation for the language.

In Python, we can name these groups and that makes it easier to access when there are complex expressions.

(?P<name>group)

Then, later in the regex, you can use un-named references like \1 or the name using the syntax: (?P=name)

Example use: (?P<x>(yyy))(?P=x) matches yyyyyy

Lookaround

Lookaround is look-ahead and look-behind, a.k.a. zero-width-assertions.

Let's assume you are parsing var>123 var<223 and you are looking for var only if it is followed by >. This is positive lookahead and the pattern var(?=>) will do it, without actually matching the greater-than-sign to cleanly isolate the word var. The negative of this will match everything that is not a greater-than-sign. var(?!>) [ It does NOT use a backreference ]

It's ok to use regex inside the lookahead as in var(?==(+|#)) This will find var if it is followed by + or #

Nagative Lookbehind

To find all letter u except those belonging to a q as in queen, you can use lookbehind.

(?<!q)u will find the first u but not match the second in the string: a b c u queen

If we want to find only the u that is preceded by q, then change ! to =. This is positive lookbehind. (?<=q)u

Another example:

Find all words not ending in a K. \b\w+(?<!K)\b

This reads, "On word boundaries: Match strings that are composed of letters, numbers or underscore, but not any that end in K"

NOTE: Forward looks can operate with any regex -- including those with indeterminate termination as in a repetition pattern, but reverse looks are (in many engines) restricted to patterns of fixed length like literals and character-classes. Some regex engines do not support lookbehind.

NOTE: Lookaround terminates the search when it is satisfied, so this makes it abandon any backtracking -- i.e. don't try to reference backtracks in a capturing group that you might have expected to be available after using lookaround to find a match.

Let's assume you want to find only 4-digit long numbers that have a 1,2,3 in them, in order. We could do this by breaking it into all permutations:

123\d|\d123

This is ok, but you can see that to list every permutation is not practical for anything much more complex. How long would this pattern be for 20 digit strings? So you can use lookaround instead.

\d{4} will match all 4-digit numbers. That's one component.

\b\d*123\d*\b Does a nice job of finding all isolated numbers of any length, and containing 123.

Combining these gives: (?=\d{4})\b\d*123\d*\b

It reads, "Lookahead for any 4-digit numbers as a condition to search for 123 preceded or succeeded by any digits."

Although that is more complex than 123\d|\d123 , to do this for 20 digits it is a trivial change.

Find any word between 3 and 8 letters long containing ie or ei

\b(?=\w{3,8}\b)\w(ie|ei)

if - then - else

This is also called ternary operation because it is composed of three parts.

Given the text:

3
axxx
3xxx
avvvv
3vvvv
ahhhhh
3hhhhh
g3htya

Let's try to match the letter a every time, but only match the 3 if it is followed by a string 5 or more long.

(?(?=\w{5})3|a) will do it. The general syntax is (?(?=regex)then|else)

So it will find a every time that the condition \w{5} fails. If \w{5} is satisfied - that is a look-ahead of 5 characters contains A-Z , a-z , or _, then look for a 3

It will find every a, but only the following strings satisfy the requirement for 3: 3vvv, 3hhahhh and 3ghtya


More by this Author


Comments

No comments yet.

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working