Sunday, July 19, 2015

Regular Expression 101

A small task in my new job requires me to match a file name. there are lots of ways to do it, and regex is among one of them. More important, it gives me a reason to finally go through the basics of regex.
This blog contains the basic concepts of regex. If you only want to get your feet wet, this is enough, otherwise please refer to the reference. :)


Twelve special characters
backslash \
caret ^
dollar sign $
dot .
vertical bar |
question mark ?
asterisk *
plus +
opening parenthesis (
closing parenthesis )
opening square bracket [
opening curly brace {

In order to match these special characters, an extra backslash is needed. e.g., 1 \+ 1 = 2 matches 1 + 1 = 2, a\\b matches a\b.

Character classes
Matches only one out of several characters. For example: "gr[ae]y" matches "gray" or "grey" but not "graey" or "greey".

A hyphen can be used to specify a range of characters:
[0-9a-fA-F] matches a single hexadecimal digit.
[0-0a-fxA-FX] matches the letter X or a hexadecimal digit.

A caret indicates does not match.
"q[^x]" matches any character after "q" except "x". For example, "qu". However, it does not match single character "q".


Shorthand character classes
\d matches a single character that is a digit.
\w matches a "word character", which is defined as all alphanumeric characters(0-9 + a-z + A-Z) plus underscore.
\s matches a white space character ([\r\n\t\f], \r: carriage return, \f: form feed, \n: new line).
Capital indicates negated versions.
\D = [^\d]
\W = [^\w]
\S = [^\s]

Non-printable characters
Special characters like \r \n \t can also be used in regex. If the application supports Unicode, use \uFFFF or \x{FFFF} to insert a Unicode character. For example, \u20AC or \x{20AC} matches the euro currency sign.
If the application does not support Unicode, use \xFF to match a specific character by its hexadecimal index in the character set. \xA9 matches the copyright symbol in the Latin-1 character set.

The dot matches any single character
except line break characters. equals[^\n] in Unix or [^\r\n] in Windows.
For example, gr.y matches gray, grey, gr&y, etc.

Anchors
^ matches the position before the first character in the string while $ matches the position after the last character. \b indicates word boundary. It allows you to perform a "whole word only" search. More specific, \b matches before and after an alphanumeric sequence. \B is the negated version of \b.
^a indicates a string starts with "a", e.g.,  "abc".
c$ indicates a string ends with "c", e.g., "abc".
^\s+ matches leading whitespace and \s+$ matches trailing whitespace.
\b4\b matches "4" but not "44" or "a4".

When there are multiple lines in one string, e.g., lineone\nlinetwo, ^ and $ can be used to match each line, rather than the entire string. e.g., $ can match the position before \n and after "e" as well as end of the string. Likewise, ^ can match the start of the string and the position after \n and before "l" in the second line.
In Java, Python PHP as well as Perl, $ matches the position before line break as well as the very end of the string, regardless of whether the character is a line break. For example, "^\d$" matches "1" or "1\n".

\A and \Z are permanent start of string and end of string anchors, they cannot match line breaks. However, JavaScript, XML POSIX or XPath does not support these two tokens. In Python, \Z matches at the very end of the string("^\d$" matches "1\n" or "1"). In other languages such as Java and PHP, \Z matches positions before the final line break ("^\d$" matches "1\n" or "1"), use \z (lower case) to match the absolute end of string ("\A\d\z" only matches "1" but not "1\n" because the end of string is a line-breaking token not a digit).

Alternation
"cat|dog" matches "cat" in "cats and dogs". If the regex is applied again, it matches "dog".

Repetition
? makes the preceding token in the regex optional.
"colou?r" matches "color" or "colour".
* tells the engine to attempt to match the proceeding token zero or more times.
+ matches the proceeding token one or more times.
"<[A-Za-z][A-Za-z0-9]*>" matches an HTML tag without any attributes.
"<[A-Za-z0-9]+>" also matches an HTML tag, but may also match an invalid one such as "<1>".
Curly braces are used to specify an amount of repetition.
"\b[1-9][0-9]{3}\b" matches a number between 1000 and 9999.
"\b[1-9][0-9]{2,4}\b" matches a number between 100 and 99999.

Greedy and Lazy Repetition
The repetition operators are greedy. They expand the match as far as they can, and only give if they have to satisfy the remainder of the regex.
For example, the regex "<.+>" matches "<EM>first</EM>" in "This is a <EM> first</EM> test". Why? Let's say s = "<.+>" and p = "This is a <EM> first</EM> test". Regex engine finds the first "<", which is "<" before "E". Now since "+" repeats the "." one or more times, the engine tries to match "." as much as possible, until the end of p ("<EM> first</EM> test"). Then it tries to match">" in s, which is not a match. Note for now ".+" matches "EM> first</EM> test" in p.
The engine then backtracks by reducing the repetition of "." by one, which matches ".+" to "EM> first</EM> tes", and it is still not a match. The engine then repeatedly reduces "." until it can match ">", which is ""<EM> first</EM>".

So what if we want to match a "<EM>"?
We can make the repetition lazy by adding "?", i.e., "<.+?>". This tells the engine to repeat "." as few times as possible. The minimum is one. Using the above example. After the engine matches "<", it tries to proceed to match ">" with "E", which is not a match, then it repeat "." by one, and matches ">" with "M", still not a match. It continuously repeats "." until it matches ">", which now matches"<.+?>" to "<EM>".

Alternatively, in this case, we can use "<[^>]+>" to match an HTML tag. This regex matches anything but ">" one ore more times after the "<" is matches. The perk of this is to avoid backtracking, which increases the speed of the regex engine.

\Q...\E Escape sequence
\Q...\E sequence can be used to escape a string of characters, matching them as literal characters. The escaped characters are treated as individual literal characters.
For example, "\Q*\d+*\E" matches "*\d+*" since all characters between "\Q" and "\E" are not treated as special characters.
When applying repeating characters such as "+" or "{3}", it will only be applied to the last character. e.g., "\Q*\d+*\E+" matches "*\d+**" in "*\d+**\d+*".


Grouping and Capturing
Place parentheses around multiple tokens to group them together. A quantifier can also be applied. Parentheses create a capturing group. Use special syntax (?:string) to group tokens without creating a capturing group. This is more efficient if you don't plan to use the group's contents. For example[1], considering pattern:
p = "http://stackoverflow.com/questions/tagged/regex".

If we apply regex:
"(http|ftp)://([^/\r\n]+)(/[^\r\n]*)?"

We will have the result:
Match "http://stackoverflow.com/questions/tagged/regex"
Group 1: "http"
Group 2: "stackoverflow.com"
Group 3: "/questions/tagged/regex"

However, if we don't need the protocol, we can use "(?:)" to create a non-capturing group:
regex: "(?:http|ftp)://([^/\r\n]+)(/[^\r\n]*)?"

Match "http://stackoverflow.com/questions/tagged/regex"
Group 1: "stackoverflow.com"
Group 2: "/questions/tagged/regex"


Using backreferences to match the same text again
Backreferences  match the same text as previously matched by a capturing group. For example, "<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> matches a pair of opening and closing HTML tags, and the text (if there is any) in between. If "<([A-Z][A-Z0-9]*)\b[^>]*>" matches "<b>" (bold text) then "</\1>" matches "</b>".

Backreferences are represented by "\" plus number. The first group starts backreference number one, i.e., "\1", the second group "\2". Non-capturing groups are skipped. Most regex engines support up to 99 capturing groups and double-digit backreferencs. So "\99" is valid if there are 99 capturing groups.

However, if the regex has many groups, it can be cumbersome to track their numbers. Alternatively, we can name our groups. In Python,  (?P<mygroup>[abc]) = (?P=mygroup) is the same as ([abc]) =\1
The HTML tag example above can now be written as "<(?P<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</?P=tag>

The same backreference can be reused more than once. "([a-c])x\1x\1" matches "axaxa" or "bxbxb" or "cxcxc".

Unicode properties
"\p{L}" matches a single character that is in the given Unicode category. "L" stands for letter. "\P{L}" matches a single character that is not in the given Unicode category. See details.

Lookaround
The tokens inside the group are matched but not kept, it tends to match the syntax outside the group.
Look ahead:
"c(?=a)" matches any "c" that follows by an "a". Thus it matches "c" in "cat" but not "c" in "click".
"c(?!a)" matches any "c" that does not follow by an "a". Thus it matches "c" in "click" but not "c" in "cat".
Look behind:
"(?<=a)b" matches any "b" that has a proceeding "a", e.g., "b" in "abc".
"(?<!a)b" matches any "b" that has a proceeding character that is not "a", e.g., "cbc" but not "abc".


References
[1] https://stackoverflow.com/questions/3512471/non-capturing-group/3513858#3513858
[2] http://www.regular-expressions.info/


No comments:

Post a Comment