Wednesday, July 22, 2015

Generators in Python

I was trying to construct a two dimensional code and encountered the following problem:

>>> l = list()
>>> l.append([] for i in range(5))
>>> l[0].append(1)
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'generator' object has no attribute 'append'

Hmm, I thought I was creating an empty list and append another 5 empty lists. However:

>>> l
[<generator genexpr="" object=""> at 0x102077fc0>]

Recall my conversation with my friend the day before about generators, I decide to dig a little bit deeper.

It turns out I am using the generator expression, and thus instead of creating a list of lists, I created a generator object.

So what is a generator function?
In Python, generator functions allow for more efficient memory use. Consider we need to get the range from 0 to a very large integer (I just realize in Python 3 there is no maximum integer), and we need operate on this range of number (e.g., print them out). We can either construct a list and operate this list, which will consume all of our memory or, we can simply generate the number and operate it at the same time. For example, we can generate the number, print it out, and only remember the current number so that we can do the next operation. Take a look at the following class:

class firstn(object):
    def __init__(self, n):
        self.n = n
        self.num, self.nums = 0, []

    def __iter__(self):
        return self

    # Python 3 compatibility
    def __next__(self):
        return self.next()

    def next(self):
        if self.num < self.n:
            cur, self.num = self.num, self.num+1
            return cur
        else:
            raise StopIteration()

The class constructs an iterator, but it does not store the each element in the memory. In fact, it only stores the current element and the element prior to it. When operating on this iterator, all we need to do is to operate, store the current result and generate the next element. This saves a lot of memory.

However, writing in this way is somehow cumbersome. Python provides an easier way to do it --yield:

>>> def firstn(n):
...     num = 0
...     while num < n:
...         yield num
...         num += 1
... 
>>> firstn(100)
<generator object firstn at 0x10207e0d8>

The firstn(n) function is written in the generator function manner.

We can even skip writing a function by using a generator expression:

>>> (n for n in range(5))
<generator genexpr="" object=""> at 0x10207e048>
>>> [n for n in range(5)]
[0, 1, 2, 3, 4]

As you can see, the only difference between a generator and a list constructor is () versus []. And this is the reason I encountered the problem I mentioned at the beginning of the post.

However, even though there are lots of perks by using generators, do remember such iterator (remember a generator is still an iterator) can only be iterate once. Because the elements are not saved in the memory, after the operation, the elements are gone. You have to create a generator (or call the function if you write a generator function) if you want to perform the operations again.

Last thing, how to solve the above-mentioned problem?

Not this way if you are thinking of it:

>>> l.append([[] for i in range(5)])
>>> l
[[[], [], [], [], []]]

append() appends the item (which is a list of lists you have just constructed) to l, which results a list that has an element of a list of lists...

either extend() or += should work:

>>> l.extend([[] for i in range(5)])
>>> l
[[], [], [], [], []]
>>> l += [[] for i in range(5)]
>>> l
[[], [], [], [], [], [], [], [], [], []]

References:
[1] https://wiki.python.org/moin/Generators
[2] http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python
[3] http://stackoverflow.com/questions/5164642/python-print-a-generator-expression

Sunday, July 19, 2015

Regular Expression 101

A small task in my new job requires me to match a file name. there are lots of ways to do it, and regex is among one of them. More important, it gives me a reason to finally go through the basics of regex.
This blog contains the basic concepts of regex. If you only want to get your feet wet, this is enough, otherwise please refer to the reference. :)


Twelve special characters
backslash \
caret ^
dollar sign $
dot .
vertical bar |
question mark ?
asterisk *
plus +
opening parenthesis (
closing parenthesis )
opening square bracket [
opening curly brace {

In order to match these special characters, an extra backslash is needed. e.g., 1 \+ 1 = 2 matches 1 + 1 = 2, a\\b matches a\b.

Character classes
Matches only one out of several characters. For example: "gr[ae]y" matches "gray" or "grey" but not "graey" or "greey".

A hyphen can be used to specify a range of characters:
[0-9a-fA-F] matches a single hexadecimal digit.
[0-0a-fxA-FX] matches the letter X or a hexadecimal digit.

A caret indicates does not match.
"q[^x]" matches any character after "q" except "x". For example, "qu". However, it does not match single character "q".


Shorthand character classes
\d matches a single character that is a digit.
\w matches a "word character", which is defined as all alphanumeric characters(0-9 + a-z + A-Z) plus underscore.
\s matches a white space character ([\r\n\t\f], \r: carriage return, \f: form feed, \n: new line).
Capital indicates negated versions.
\D = [^\d]
\W = [^\w]
\S = [^\s]

Non-printable characters
Special characters like \r \n \t can also be used in regex. If the application supports Unicode, use \uFFFF or \x{FFFF} to insert a Unicode character. For example, \u20AC or \x{20AC} matches the euro currency sign.
If the application does not support Unicode, use \xFF to match a specific character by its hexadecimal index in the character set. \xA9 matches the copyright symbol in the Latin-1 character set.

The dot matches any single character
except line break characters. equals[^\n] in Unix or [^\r\n] in Windows.
For example, gr.y matches gray, grey, gr&y, etc.

Anchors
^ matches the position before the first character in the string while $ matches the position after the last character. \b indicates word boundary. It allows you to perform a "whole word only" search. More specific, \b matches before and after an alphanumeric sequence. \B is the negated version of \b.
^a indicates a string starts with "a", e.g.,  "abc".
c$ indicates a string ends with "c", e.g., "abc".
^\s+ matches leading whitespace and \s+$ matches trailing whitespace.
\b4\b matches "4" but not "44" or "a4".

When there are multiple lines in one string, e.g., lineone\nlinetwo, ^ and $ can be used to match each line, rather than the entire string. e.g., $ can match the position before \n and after "e" as well as end of the string. Likewise, ^ can match the start of the string and the position after \n and before "l" in the second line.
In Java, Python PHP as well as Perl, $ matches the position before line break as well as the very end of the string, regardless of whether the character is a line break. For example, "^\d$" matches "1" or "1\n".

\A and \Z are permanent start of string and end of string anchors, they cannot match line breaks. However, JavaScript, XML POSIX or XPath does not support these two tokens. In Python, \Z matches at the very end of the string("^\d$" matches "1\n" or "1"). In other languages such as Java and PHP, \Z matches positions before the final line break ("^\d$" matches "1\n" or "1"), use \z (lower case) to match the absolute end of string ("\A\d\z" only matches "1" but not "1\n" because the end of string is a line-breaking token not a digit).

Alternation
"cat|dog" matches "cat" in "cats and dogs". If the regex is applied again, it matches "dog".

Repetition
? makes the preceding token in the regex optional.
"colou?r" matches "color" or "colour".
* tells the engine to attempt to match the proceeding token zero or more times.
+ matches the proceeding token one or more times.
"<[A-Za-z][A-Za-z0-9]*>" matches an HTML tag without any attributes.
"<[A-Za-z0-9]+>" also matches an HTML tag, but may also match an invalid one such as "<1>".
Curly braces are used to specify an amount of repetition.
"\b[1-9][0-9]{3}\b" matches a number between 1000 and 9999.
"\b[1-9][0-9]{2,4}\b" matches a number between 100 and 99999.

Greedy and Lazy Repetition
The repetition operators are greedy. They expand the match as far as they can, and only give if they have to satisfy the remainder of the regex.
For example, the regex "<.+>" matches "<EM>first</EM>" in "This is a <EM> first</EM> test". Why? Let's say s = "<.+>" and p = "This is a <EM> first</EM> test". Regex engine finds the first "<", which is "<" before "E". Now since "+" repeats the "." one or more times, the engine tries to match "." as much as possible, until the end of p ("<EM> first</EM> test"). Then it tries to match">" in s, which is not a match. Note for now ".+" matches "EM> first</EM> test" in p.
The engine then backtracks by reducing the repetition of "." by one, which matches ".+" to "EM> first</EM> tes", and it is still not a match. The engine then repeatedly reduces "." until it can match ">", which is ""<EM> first</EM>".

So what if we want to match a "<EM>"?
We can make the repetition lazy by adding "?", i.e., "<.+?>". This tells the engine to repeat "." as few times as possible. The minimum is one. Using the above example. After the engine matches "<", it tries to proceed to match ">" with "E", which is not a match, then it repeat "." by one, and matches ">" with "M", still not a match. It continuously repeats "." until it matches ">", which now matches"<.+?>" to "<EM>".

Alternatively, in this case, we can use "<[^>]+>" to match an HTML tag. This regex matches anything but ">" one ore more times after the "<" is matches. The perk of this is to avoid backtracking, which increases the speed of the regex engine.

\Q...\E Escape sequence
\Q...\E sequence can be used to escape a string of characters, matching them as literal characters. The escaped characters are treated as individual literal characters.
For example, "\Q*\d+*\E" matches "*\d+*" since all characters between "\Q" and "\E" are not treated as special characters.
When applying repeating characters such as "+" or "{3}", it will only be applied to the last character. e.g., "\Q*\d+*\E+" matches "*\d+**" in "*\d+**\d+*".


Grouping and Capturing
Place parentheses around multiple tokens to group them together. A quantifier can also be applied. Parentheses create a capturing group. Use special syntax (?:string) to group tokens without creating a capturing group. This is more efficient if you don't plan to use the group's contents. For example[1], considering pattern:
p = "http://stackoverflow.com/questions/tagged/regex".

If we apply regex:
"(http|ftp)://([^/\r\n]+)(/[^\r\n]*)?"

We will have the result:
Match "http://stackoverflow.com/questions/tagged/regex"
Group 1: "http"
Group 2: "stackoverflow.com"
Group 3: "/questions/tagged/regex"

However, if we don't need the protocol, we can use "(?:)" to create a non-capturing group:
regex: "(?:http|ftp)://([^/\r\n]+)(/[^\r\n]*)?"

Match "http://stackoverflow.com/questions/tagged/regex"
Group 1: "stackoverflow.com"
Group 2: "/questions/tagged/regex"


Using backreferences to match the same text again
Backreferences  match the same text as previously matched by a capturing group. For example, "<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> matches a pair of opening and closing HTML tags, and the text (if there is any) in between. If "<([A-Z][A-Z0-9]*)\b[^>]*>" matches "<b>" (bold text) then "</\1>" matches "</b>".

Backreferences are represented by "\" plus number. The first group starts backreference number one, i.e., "\1", the second group "\2". Non-capturing groups are skipped. Most regex engines support up to 99 capturing groups and double-digit backreferencs. So "\99" is valid if there are 99 capturing groups.

However, if the regex has many groups, it can be cumbersome to track their numbers. Alternatively, we can name our groups. In Python,  (?P<mygroup>[abc]) = (?P=mygroup) is the same as ([abc]) =\1
The HTML tag example above can now be written as "<(?P<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</?P=tag>

The same backreference can be reused more than once. "([a-c])x\1x\1" matches "axaxa" or "bxbxb" or "cxcxc".

Unicode properties
"\p{L}" matches a single character that is in the given Unicode category. "L" stands for letter. "\P{L}" matches a single character that is not in the given Unicode category. See details.

Lookaround
The tokens inside the group are matched but not kept, it tends to match the syntax outside the group.
Look ahead:
"c(?=a)" matches any "c" that follows by an "a". Thus it matches "c" in "cat" but not "c" in "click".
"c(?!a)" matches any "c" that does not follow by an "a". Thus it matches "c" in "click" but not "c" in "cat".
Look behind:
"(?<=a)b" matches any "b" that has a proceeding "a", e.g., "b" in "abc".
"(?<!a)b" matches any "b" that has a proceeding character that is not "a", e.g., "cbc" but not "abc".


References
[1] https://stackoverflow.com/questions/3512471/non-capturing-group/3513858#3513858
[2] http://www.regular-expressions.info/