Sunday, May 31, 2015

K-means algorithm

K-means is a standard clustering method in machine learning. It aims to partition n observations in d dimension into k clusters in which each observation belongs to the cluster with the nearest mean.

The algorithm provided here uses standard Lloyd's algorithm. It is provided by (not me) Dpark as an example. Dpark is a Python clone of Spark written by a group of enthusiastic developers. Even though Spark now provides PySpark as a Python API for Spark, this project is still an impressive one. Enough credits should be given. :)

Note, this is not a tutorial about Spark or Dpark, for more information about Spark in Pyton, check PySpark doc or Dpark manual (in Chinese).

The original source code is from here.

The Llyord algorithm

  • Input: 
    • a set of n data points
    • initialize randomly k centroids
  • Repeat until convergence:
    • set point i to the nearest centroid based on Euclidean distance
    • update cluster centroid to be the mean of all the points assign to it

The main function

Here are some Spark methods used in the main function, the link on the name of the function directs to PySpark documentation, the second link directs to Dpark src.
textFile(): read file from file system. src
cache(): cache this RDD. src
map(): map the RDD to a new RDD based on the provided function. src
In the K-means implementation:

mappedPoints = points.map(lambda p: (closestCenter(p, centers), (p, 1)))

maps each point to the function closestCenter(p, centers), which assigns the point to the closest center, and return a key-value pair, where the key is the centroid index and the value is the tuple (point, 1), 1 is the count.

reduceByKey(): Merge the values for each key using the input function. src
In the K-means implementation:

mappedPoints.reduceByKey(
                lambda (s1,c1),(s2,c2): (s1+s2,c1+c2)
            )

aggregates the mappedPoints (key-value pair) based on the function (sum(points), sum(count)), and returns (centroid index, (sum of points, sum of counts))

collectAsMap(): return the key-value pair src


if __name__ == '__main__':
    # initialization
    D = 4  # d dimension
    K = 3  # number of clusters
    IT = 10  # number of iterations
    MIN_DIST = 0.01  # threshold for convergence
    # initialize k random centroids
    centers = [Vector([random.random() for j in range(D)]) for i in range(K)]
    # read the data points from file and parse them to vectors
    points = dpark.textFile('kmeans_data.txt').map(parseVector).cache()

    # iteration
    for it in range(IT):
        print 'iteration', it
        # assign each point to the closest centroid
        # return (centroid index, (point, count = 1))
        mappedPoints = points.map(lambda p: (closestCenter(p, centers), (p, 1)))
        # calculate the new center of the points within the cluster
        # reduceByKey() returns (centroid index, (sum of points, sum of counts)
        # then map the output to (centroid index, sum of points/sum of counts)
        # sum of points/sum of counts is the new center of the cluster
        ncenters = mappedPoints.reduceByKey(
                lambda (s1,c1),(s2,c2): (s1+s2,c1+c2)
            ).map(
                lambda (id, (sum, count)): (id, sum/count)
            ).collectAsMap()

        updated = False
        # update the new center
        for i in ncenters:
            # if the distance between the center and the new center is greater
            # than MIN_DIST, update the center to new center
            if centers[i].dist(ncenters[i]) > MIN_DIST:
                centers[i] = ncenters[i]
                updated = True
        # if there is no update, then all points are clustered
        # break the loop
        if not updated:
            break
        print centers

    print 'final', centers




parseVector()

The method first split the line (from the file) by white space then returns a Vector object.

def parseVector(line):
    return Vector(map(float, line.strip().split(' ')))

The Vector class can be found here.

Note: map(function, iterable) from Python
maps every item in an iterable based on the provided function and return a new iterator. In the above function, the splitted line is mapped to float number.



closestCenter()

assign the point to the closest centroid based on the Euclidean distance.

def closestCenter(p, centers):
    bestDist = p.squaredDist(centers[0])
    bestIndex = 0
    for i in range(1, len(centers)):
        d = p.squaredDist(centers[i])
        if d < bestDist:
            bestDist = d
            bestIndex = i
    return bestIndex

squaredDist() is in the Vector class.

Note: zip() from Python
Make an iterator that aggregates each element from each of the iterable. For example, in the squaredDist() method:

 return sum((a-b)*(a-b) for a,b in zip(self.data, o.data))

self.data and o.data are coordinates of points, say (1.0, 2.0, 3.0) and (4.0, 5.0, 6.0), zip function returns an object of ((1.0, 4.0), (2.0, 5.0), (3.0, 6.0))


Wednesday, May 27, 2015

Mysql note

Not all commands need to be in the same line. For example:


mysql> select 
    -> user();
+----------------+
| user()         |
+----------------+
| root@localhost |
+----------------+
1 row in set (0.00 sec)


If you decide not to execute the command in the process of entering it, type \c to cancel it:

mysql> select 
    -> user()
    -> ,now()
    -> \c
mysql> 


The table that summarizes the meaning of each prompt:

PromptMeaning
mysql>Ready for new command.
->Waiting for next line of multiple-line command.
'>Waiting for next line, waiting for completion of a string that began with a single quote (').
">Waiting for next line, waiting for completion of a string that began with a double quote (").
`>Waiting for next line, waiting for completion of an identifier that began with a backtick (`).
/*>Waiting for next line, waiting for completion of a comment that began with /*.


Create a table:

mysql> create table pet (name varchar(20), owner varchar(20),
    -> species varchar(20), sex char(1),birth DATE, death DATE);
Query OK, 0 rows affected (0.02 sec)


Use describe to see the information of the table:



describe pet;
+---------+-------------+------+-----+---------+-------+
| Field   | Type        | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| name    | varchar(20) | YES  |     | NULL    |       |
| owner   | varchar(20) | YES  |     | NULL    |       |
| species | varchar(20) | YES  |     | NULL    |       |
| sex     | char(1)     | YES  |     | NULL    |       |
| birth   | date        | YES  |     | NULL    |       |
| death   | date        | YES  |     | NULL    |       |
+---------+-------------+------+-----+---------+-------+
6 rows in set (0.01 sec)


Load data to table:


mysql> load data local infile '/path/pet.txt' into table pet;
Query OK, 8 rows affected, 9 warnings (0.00 sec)
Records: 8  Deleted: 0  Skipped: 0  Warnings: 9

mysql> select * from pet;
+----------+--------+---------+------+------------+------------+
| name     | owner  | species | sex  | birth      | death      |
+----------+--------+---------+------+------------+------------+
| Fluffy   | Harold | cat     | f    | 1993-02-04 | NULL       |
| Claws    | Gwen   | cat     | m    | 1994-03-17 | 0000-00-00 |
| Buffy    | Harold | dog     | f    | 1989-05-13 | NULL       |
| Fang     | Benny  | dog     | m    | 1990-08-27 | NULL       |
| Bowser   | Diane  | dog     | m    | 1979-08-31 | 1995-07-29 |
| Chirpy   | Gwen   | bird    | f    | 1998-09-11 | 0000-00-00 |
| Whistler | Gwen   | bird    | N    | 1997-12-09 | 0000-00-00 |
| Slim     | Benny  | snake   | m    | 1996-04-29 | NULL       |
+----------+--------+---------+------+------------+------------+
8 rows in set (0.00 sec) 


Insert values:


mysql> insert into pet 
    -> values ('Jiemee', 'Shirley','cat','f', '2000-01-20', '2014-10-28');
Query OK, 1 row affected (0.01 sec)

mysql> select * from pet where name = 'Jiemee';
+--------+---------+---------+------+------------+------------+
| name   | owner   | species | sex  | birth      | death      |
+--------+---------+---------+------+------------+------------+
| Jiemee | Shirley | cat     | f    | 2000-01-20 | 2014-10-28 |
+--------+---------+---------+------+------------+------------+
1 row in set (0.00 sec)



Monday, May 25, 2015

All the Tree Traversal you need (not recursively!) Too

I got this idea of going through Leetcode again to learn Python in a more fun way. The first thing is basic data structures, and we definitely cannot forget binary tree.

Here are all the tree traversal solutions. All solutions are accepted on Leetcode. I only include the recursive version for the preorder traversal, since all others are similar and trivial.

If you need the Java version, see here.

Enjoy. :)


Preorder

The recursive way has two versions, the first one uses two functions while the second one use a nested recursive function inside the global one. See here for my post about the nested function in Python.

def preOrder(root):
    """
    iterative version
    :param root:
    :return:
    """
    rst = []
    if not root:
        return rst
    stack = [root]
    while stack:
        curr = stack.pop()
        if curr.right is not None:
            stack.append(curr.right)
        if curr.left is not None:
            stack.append(curr.left)
        rst.append(curr.val)
    return rst

def preOrderRec(root):
    """
    recursive version
    :param root:
    :return:
    """
    rst = []
    preorderHelper(root, rst)
    return rst

def preorderHelper(root, rst):
    if root is not None:
        rst.append(root.val)
        preorderHelper(root.left, rst)
        preorderHelper(root.right, rst)


def preOrderRec2(root):
    """
    nested inner recursive function
    :param root:
    :return:
    """
    rst = []

    def recursion(root):
        if root:
            rst.append(root.val)
            recursion(root.left)
            recursion(root.right)
    recursion(root)
    return rst


Inorder:


def inOrder(root):
    """
    in order traversal, iterative
    use a stack
    :param root:
    :return:
    """
    rst = []
    if not root:
        return rst
    stack = []
    while root or stack:
        if root:
            stack.append(root)
            root = root.left
        else:
            root = stack.pop()
            rst.append(root.val)
            root = root.right
    return rst



Postorder:
This one is quite different from the Java version. Since Python does not have any peek() method for the list class, I cannot use the way I used in the Java version (Actually, there is a way: use a deque, popleft() first and then appendleft(), but I don't think it's quite efficient).
In Python, I used two stacks. Basically, it looks like this:

  1. append the root to the first stack
  2. pop the root out the first stack and append it to the second stack
  3. append the left child followed by the right child of the root (if any exists)
  4. repeat 2 and 3 until the first stack is empty
  5. pop the nodes in the second stack and append it to the result list

Below the Python version is the Java version using the conventional method, you can compare the two method. :)


def postOrder(root):
    """
    iterative way, using two stacks
    :param root:
    :return:
    """
    rst = []
    if not root:
        return rst
    stack1 = [root]
    stack2 = []
    while stack1:
        curr = stack1.pop()
        if curr.left is not None:
            stack1.append(curr.left)
        if curr.right is not None:
            stack1.append(curr.right)
        stack2.append(curr)
    while stack2:
        rst.append(stack2.pop().val)
    return rst


Java version:

public ArrayList postorderTraversal(TreeNode root) {
        ArrayList rst = new ArrayList();
        if (root == null)
            return rst;
        Stack stack = new Stack();
        TreeNode curr = null;
        TreeNode prev = null;
        stack.push(root);
        while (!stack.empty())
        {
            curr = stack.peek();
            //prev == null, root
            //traverse down
            if (prev == null || prev.left == curr || prev.right == curr)
            {
                if (curr.left != null)
                    stack.push(curr.left);
                //need to traverse to the bottom of the tree first before goes to the right side
                else if (curr.right != null)
                    stack.push(curr.right);
            }
            // here comes the right side
            else if (curr.left == prev)
            {
                if (curr.right != null)
                    stack.push(curr.right);
            }
            //finally we can add nodes
            else
            {
                rst.add(curr.val);
                stack.pop();
            }
            prev = curr;
        }
        return rst;
    }


Level order:


from collections import deque

def levelOrder(root):
    """
    level order traversal
    :param root:
    :return:
    """
    rst = []
    if not root:
        return rst
    q = deque([root])
    while q:
        size = len(q)
        level = []
        for i in range(size):
            curr = q.popleft()
            if curr.left is not None:
                q.append(curr.left)
            if curr.right is not None:
                q.append(curr.right)
            level.append(curr.val)
        rst.append(level)
    return rst



Zigzag level order:

def zigzag(root):
    rst = []
    if not root:
        return rst
    thisLevel = [root]
    reverse = True
    while thisLevel:
        nextLevel = []
        level = []
        while thisLevel:
            curr = thisLevel.pop()
            level.append(curr.val)
            if reverse:
                if curr.left is not None:
                    nextLevel.append(curr.left)
                if curr.right is not None:
                    nextLevel.append(curr.right)
            else:
                if curr.right is not None:
                    nextLevel.append(curr.right)
                if curr.left is not None:
                    nextLevel.append(curr.left)

        thisLevel = nextLevel
        reverse = not reverse
        rst.append(level)
    return rst


Nested function in Python

I was revisiting tree traversal using Python and I realized that Python has this interesting technique, write a function inside another function:


def preOrderRec2(root):
    rst = []
    
    def recursion(root):
        if root:
            rst.append(root.val)
            recursion(root.left)
            recursion(root.right)
    recursion(root)
    return rst


The above example is a tree preorder traversal, inside the global function is the recursive function we use. The advantage of using the inner function is to avoid global exposure of the inner function, since, in the above example,  recursion() will only be called by preOrderRec2(). Some other advantages can be found here.

Unfortunately Java does not allow you to use such great trick. In Java, we will have to write in this way:


public class Solution {
    public List preorderTraversal(TreeNode root) {
        List rst = new ArrayList();
        preOrder(root, rst);
        return rst;
    }
    public void preOrder(TreeNode root, List rst){
        if (root == null)
            return;
        rst.add(root.val);
        preOrder(root.left, rst);
        preOrder(root.right, rst);
    }
}


Friday, May 22, 2015

Singly linked list using Python, iterator in a Python object

The singly linked list is nothing fancy, basically it is a list of nodes appended one after another. Each node has a pointer points to its successor.

class _ListNode:

    def __init__(self, data):
        self.data = data
        self.next = None


Since any node can only be accessed by its predecessor, traversing the list to search for a node takes O(n) time. If we keep tracking the current tail of the node, adding a new node only requires us to append the new node to the tail of the list, thus takes O(1) time. Deleting a node, on the other hand, requires traversing the list to find the node to delete, thus takes O(n) time.

Here is the source code.

class _ListNode:

    def __init__(self, data):
        self.data = data
        self.next = None

class LinkedList:
    def __init__(self):
        self._head = _ListNode(-1) #sentinel node
        self._curr = self._head
        self._size = 0

    def __len__(self):
        return self._size

    def __add__(self, data):
        """
        append a new node
        """
        self._curr.next = _ListNode(data)
        self._curr = self._curr.next
        self._size += 1

    def __contains__(self, item):
        """
        check if a given item is in the list
        :param item:
        :return:
        """
        curNode = self._head.next
        while curNode is not None and curNode.data != item:
            curNode = curNode.next
        return curNode is not None

    def __delattr__(self, item):
        """
        delete a node if exists in the list
        :param item:
        :return:
        """
        curNode = self._head.next
        while curNode is not None \
                and curNode.next is not None \
                and curNode.next.data != item:
            curNode = curNode.next
        if curNode.next is None:
            print('No such item in the linked list!')
            return False
        else:
            curNode.next = curNode.next.next
            self._size -= 1
            return True

    def __iter__(self):
        curNode = self._head.next
        assert curNode is not None, 'Empty list!'
        while curNode is not None:
            data = curNode.data
            #print(curNode.data)
            curNode = curNode.next
            yield data

One more thing I learned today about Python is the iterator of an object. When we write a data structure using Python, we sometimes write __iter__() function. This function will return an iterator of the object. So next time when you write:

for item in object:

The compiler will call __iter__() and return the items in the object based on your __iter__() function.
Another useful places for this function is that you can initialize a list. For example, if I want to create a list based on the items in the above linked list, I can simply do:

tolist = list(linkedlist)


Thursday, May 21, 2015

Data Structures in Python compared with Java: HashMap, collision resolution

Here comes my favorite data structure in Java, the HashMap. In Python, a similar implementation is called Dictionary. Both uses hash table implementation. I don't want to go through the concept of hash table, but feel free to Wiki it.

The main difference between Java's HashMap and Python's dictionary is the collision resolution. In short, when you insert a key-value pair into the table, a hash function will "transfer" the key to a hash value which will then be mod by the length of the table and get the position of the table to insert this key-value pair. However, in some cases, especially when your hash function is NOT A GOOD ONE, multiple keys will be hashed to the same spot. So the question is, how to solve this problem?

First of all, is to write a good hash function. Now consider you are not a super genius mathematician, you want to find other ways to make up for it. And here comes the collision resolution.

Java
In Java, the HashMap implementation uses what is called separate chaining. The hash table itself is an array of linked list. When multiple keys are hashed to the same slot, they are appended to their predecessor. See the code snippet from JDK 7 implementation.

public V put(K key, V value) {
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);
        }
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key);
        int i = indexFor(hash, table.length);
        for (Entry e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }


Now here comes the searching problem, what if all keys are hashed to the same slot? Then we will have the worst case, linear search time. However, it is shown that when the load factor is less than 2, the average searching time is O(1)[1]. In Java, the load factor is 0.75, so, not too bad.

Ok, definition:
load factor = number of key-value pairs in the table / capacity of the table

If you are interested where the above code comes from and how Java implements HashMap, see here.


Python
Python developers decided to use another method: double hashing. This is a probing method. When collision occurs, we need to probe and find another empty slot for the key-value pair. There are several different ways of probing, including linear probing, quadratic probing and double hashing. In double hashing, when collision occurs, the key is hashed by a second hashing function and the result is used as a constant factor to find another slot:

slot = (hash1(key) + i * hash2(key)) % length(table)

i indicates the i-th probe. For example, if 'a' is hashed to slot 1 and it is occupied, and hash2('a') = 37, then slot = (1 + 1 * 37) % length(table). Assuming this time it's 3, but 3 is also occupied, then i = 2, and we calculate another slot.

The advantage of double hashing is that multiple keys that are hashed to the same slot will not have the same probing sequence, thus reduces what is called "clustering" problem. In short, the chance of each slot get selected is more uniformly distributed across the table compare to linear and quadratic probing.

However, this means hash1 should not equal hash2. It is shown that when load factor is between 1/2 to 2/3, using double hash allows hashing operations to be on average O(1)[1].

I don't have the CPhython dict implementation here, but I have my own code. ;D


import math

UNUSED = None
DEFAULT_INITIAL_CAPACITY = 1 << 4  # initial size of the table


class HashMap:

    loadFactor = 2/3

    def __init__(self, capacity=DEFAULT_INITIAL_CAPACITY, load_factor=loadFactor):
        self._table = [None] * capacity
        self._size = 0
        self.loadFactor = load_factor
        self._threshold = math.ceil(len(self._table) * load_factor)

    def __len__(self):
        """
        return the size of the hashmap
        :return:
        """
        return self._size

    def __contains__(self, key):
        """
        if the hashmap contains a given key
        :param key:
        :return:True if contains the value
        """

        return self._findSlot(key, True) is not None

    def add(self, key, value):
        """
        add a key-value pair, if the key already exists, update the value
        :param key:
        :param value:
        :return:
        """
        if key in self:
            slot = self._findSlot(key, True)
            self._table[slot] = _MapEntry(key, value)
            return False
        else:
            slot = self._findSlot(key, False)
            self._table[slot] = _MapEntry(key, value)
            self._size += 1
            if self._size >= self._threshold:
                self._resize()
            return True

    def valueOf(self, key):
        """
        given a key, return the value if the key exists in the table
        :param key:
        :return:
        """
        slot = self._findSlot(key, True)
        assert slot is not None, "Map does not contain such key."
        return self._table[slot].value


    def remove(self, key):
        """
        given a key, remove the key-value pair if the key exists in the table
        :param key:
        :return:
        """
        slot = self._findSlot(key, False)
        assert slot is not None, "Map does not contain such key."
        entry = self._table[slot]
        self._table.remove(entry)
        return entry.value

    def __iter__(self):
        return self

    def _findSlot(self, key, exist):
        """
        Given a key, find a spot in the table
        :param key:
        :param exist:boolean value indicates if the given key is assumed
        to be in the table or not
        :return:if exist, return the slot where the key-value pair is stored if
        the key exists in the table, otherwise return None
        if not exist, return an unused slot
        """
        slot = self._hash1(key)
        step = self._hash2(key)

        M = len(self._table)
        while self._table[slot] is not UNUSED:
            if exist and \
                    (self._table[slot].key == key):
                return slot
            else:
                slot = (slot + step) % M
        if not exist:
            return slot

    def _resize(self):
        """
        resize the table when the size reaches threshold
        :return:
        """
        origTable = self._table
        newSize = len(self._table) * 2 + 1
        self._table = [None] * newSize

        self._size = 0
        self._threshold = math.ceil(len(self._table) * self.loadFactor)

        for entry in origTable:
            if entry is not UNUSED:
                slot = self._findSlot(entry.key, False)
                self._table[slot] = entry
                self._size += 1

    def _hash1(self, key):
        """
        main hash for mapping keys to the table
        :param key:
        :return:
        """
        return abs(hash(key)) % len(self._table)

    def _hash2(self, key):
        """
        second hash for double hashing probes
        :param key:
        :return:
        """
        return 1 + abs(hash(key)) % (len(self._table) - 2)


class _MapEntry:

    def __init__(self, key, value):
        self.key = key
        self.value = value



Reference:
[1] Rance D. Necaise, "Data Structures and Algorithms Using Python". Wiley, 2010.



*********************************************************************

"I'm no fool, no, I'm not a follower
I don't take things as they come, if they bring me down
Life can be cruel, if you're a dreamer
I just wanna have some fun, don't tell me what can't be done

You know you like it but it drives you insane
You know you like it but it drives you insane
You know you like it but you're scared of the shame
What you want, what you gonna do?
You know you like it but it drives you insane
Follow me 'cause you know that you wanna feel the same
You know you like it but it drives you insane
What you want, what you gonna do?"


Tuesday, May 19, 2015

Installing MySQL on Mac: guide for dummies

This tutorial provides instructions to install mysql using Mac mini vault's Mac - script. For dummies and newbies like me, it is extremely convenient.

Before you start, please read the README file provided by MMV.

1. Copy the following command:

bash <(curl -Ls http://git.io/eUx7rg)

This command will automatically MySQL and install it. It will ask if you want to load MySQL on boot. Since I am installing in it on my laptop, I didn't choose that.

2. After MySQL is installed, close the terminal and reopen it again. Now you are able to operate MySQL on terminal.

3. Log in
If you are using your computer as the host, enter the following command:

mysql -u root -p

Otherwise, use this command:

mysql -h host -u username -p

You probably will have to ask your admin for the hostname.
Enter the password in the file that is located on your Desktop with the name "MYSQL_PASSWORD". It was generated automatically when you were installing MySQL.
If you are successfully logged in, you should be able to see this:


NOTE: 
If you get this error message: 
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)

That usually means your MySQL server is not started. Go to step 5, start your server and try again.


4. Changing password 
MMV's Mac script will automatically generate a random password. Most of time it's hard to remember. If you want to change the password to a personalized one, after you log on to the server, do this: 


SET PASSWORD FOR 'root'@'localhost' = PASSWORD('SHIRLEY_IS_AWESOME');

Put your personalized password within the ' ' unless you want to set 'SHIRLEY_IS_AWESOME' as your password. ;)

If you happen to forget your password, MMV has this command to help you reset the password.Don't forget to read the README file.

bash <(curl -Ls http://git.io/9xqEnQ)

5. Start and Stop sql server (if it is your computer)
If you have selected to load MySQL on boot, then MySQL server will automatically start every time you start your computer. I don't use it very often, so I didn't select such option. Thus, I need to manually turn the server on.
Go to System Preferences:



Find MySQL: 

Open it, select Start MySQL Server:

Enter your admin password. If you see this green "running" signal, you are ready to go!


There are other more professional ways to start and stop the server, see here


Friday, May 15, 2015

Matrix using Python


My second post regarding data structure using Python is about 2 dimension array and matrix.

For 2D array, naturally we can think of using nested lists to implement it, here is the class: 



class Array2d:
    def __init__(self, numRows, numCols):
        self._theRows = [None] * numRows
        for i in range (numRows):
            self._theRows[i] = [None] * numCols
    #return the number of rows
    def numRows(self):
        return len(self._theRows)

    #return number of columns
    def numCols(self):
        return len(self._theRows[0])

    def clear(self):
        for row in range(self.numRows()):
            row.clear()

    #initialize all 2d array to a value
    def initialize(self, value):
        for row in range(self.numRows()):
            for col in range(self.numCols()):
                self[row, col] = value

 
    def __getitem__(self, key):
        assert len(key) == 2, "Invalid number of coordinates"
        row = key[0]
        col = key[1]
        assert 0 <= row < self.numRows() \
            and 0 <= col < self.numCols(), \
            "Array subscript out of range."
        return self._theRows[row][col]

    def __setitem__(self, key, value):
        assert len(key) == 2, "Invalid number of coordinates"
        row = key[0]
        col = key[1]
        assert 0 <= row < self.numRows() \
            and 0 <= col < self.numCols(), \
            "Array subscript out of range."
        self._theRows[row][col] = value




The matrix class is implemented from the above Array2d class:


from src import Array2D

class Matrix:

    def __init__(self, numRows, numCols,initialValue):
        self._matrix = Array2D.Array2d(numRows, numCols)
        self._matrix.initialize(initialValue)

    def numRows(self):
        return self._matrix.numRows()

    def numCols(self):
        return self._matrix.numCols()

    def __getitem__(self, key):
        return self._matrix.__getitem__(key)

    def __setitem__(self, key, value):
        return self._matrix.__setitem__(key, value)

    def scaleBy(self,scalar):
        for row in range(self.numRows()):
            for col in range(self.numCols()):
                self[row, col] *= scalar

    def transpose(self):
        trans = Matrix(self.numCols(),self.numRows(), 1)
        for r in range(self.numCols()):
            for c in range(self.numRows()):
                trans[r, c] = self[c, r]
        return trans

    def __add__(self, matrixB):
        assert matrixB.numRows() == self.numRows() \
            and matrixB.numCols() == self.numCols(), \
            "Matrix sizes not compatible for the add operation"
        sumMatrix = Matrix(self.numRows(),self.numCols(),0)
        for r in range(self.numRows()):
            for c in range(self.numCols()):
                sumMatrix[r, c] = self[r, c] + matrixB[r, c]
        return sumMatrix

    def __sub__(self, matrixB):
        #indention after '\'
        assert matrixB.numRows() == self.numRows() \
            and matrixB.numCols() == self.numCols(), \
            "Matrix sizes not compatible for the subtraction operation"
        difference = Matrix(self.numRows(),self.numCols(),0)
        for r in range(self.numRows()):
            for c in range(self.numCols()):
                difference[r, c] = self[r, c] - matrixB[r, c]# = difference.__getitem__(r, c)
        return difference

    def __mul__(self, matrixB):
        assert matrixB.numRows() == self.numCols() \
            and matrixB.numCols() == self.numRows(), \
            "Matrix sizes not compatible for the subtraction operation"
        product = Matrix(self.numRows(), matrixB.numCols(), 0)
        for r in range(self.numRows()):
            for c in range(matrixB.numCols()):
                for n in range(self.numCols()):
                    product[r, c] += self[r, n] * matrixB[n, c]
        return product

    def print(self):
        for r in range(self.numRows()):
            for c in range(self.numCols()):
                print(self[r, c], end=' ')
            print()


This blog is more about the details of writing Python than the actual implementation of the two classes. Here are some of the notes:

1. Modules
Module is a familiar name in Python. Officially, it defines as:
A module is a file containing Python definitions and statements.a way to put definitions in a file and use them in a script or in an interactive instance of the interpreter. 
It would be easier to think it as just files you write your code.  There is no module in Java upon Java 8 ("In Java 9, "modules", a kind of collection of packages, are planned as part of Project jigsaw these were earlier called "superpackages" and originally planned for Java 7"), but Java uses "package" for similar purposes.

To create an object from a class written in a module, we have to call that class from the module. e.g.,

matrixA = Matrix.Matrix(rows, cols, 1)


This definitely is not a good naming strategy. I was following a tutorial in which they defined a class inside the module, I thought it was the same as in Java, apparently I was not right. However, there is an easier way, which is just writing functions inside the file without defining "class Matrix", I will remember next time. :)

2. __setitem__(self, key, value), __getitem__(self, key) and methods to emulating container types
Theres two methods are part of the methods that can be used to emulate containers (e.g., dictionary, list, etc. ). __getitem__ () can be called by self[key]. 
One easy way to set a value is to call:

self[key] = value

3. Explicit and implicit line joining
It is better for the length of a line of code to be within 80 characters, so when a line is too long, we need to separate it to two lines. In Python, two or more physical lines may be joined into logical lines using backslash ('\'). Note a line ending in a backslash cannot carry a comment. Indention is preferred after the first line.


assert matrixB.numRows() == self.numRows() \
            and matrixB.numCols() == self.numCols(), \
            "Matrix sizes not compatible for the subtraction operation"


Implicit line joining is used in parentheses, square brackets or curly braces and can carry comments in the end of the line.
Java does not require explicit line joining.

4. Comparison
 This is actually pretty cool. Finally we can write comparison the way we write on paper:

0 <= row < self.numRows()


Apparently Java does not allow us to write it in such way.


The tester code:

from src import Matrix

def main():
    if __name__ == "__main__":
        rows = 3
        cols = 2
        matrixA = Matrix.Matrix(rows, cols, 1)
        matrixB = Matrix.Matrix(cols, rows, 1)
        matrixC = Matrix.Matrix(rows, cols, 1)
        matrixA[0, 1] = 2
        matrixA[2, 1] = 3
        matrixB[1, 1] = 3
        matrixB[0, 2] = 5
        matrixC[0, 1] = 3
        matrixC[0, 1] = 1

        print("Matrix A", end='\n')
        matrixA.print()

        print("\nMatrix B", end='\n')
        matrixB.print()

        print("\nMatrix C", end='\n')
        matrixC.print()

        sumM = matrixA.__add__(matrixC)
        print("\nSum of A and C", end='\n')
        sumM.print()

        difM = matrixA.__sub__(matrixC)
        print("\nDifference between A and C", end='\n')
        difM.print()

        prodM = matrixA.__mul__(matrixB)
        print("\nProduct of A and B", end='\n')
        prodM.print()

        print("\nTranspose of A", end='\n')
        matrixA.transpose().print()


main()

More
I was going through the fancy Numpy's implementation of matrix, then I found the build in function __new__(). I thought that __init__() serves as the constructor role, apparently I was only partially right: they do together. __new__() is responsible of creating a new instance, it can return an instance of the class, while __init__() takes charge of customizing it. If __new__() returns an instance of the class, then the new instance's __init__() method will be invoked. Since __new__() and __init__() work together to construct new objects,  NO non - None value may be returned by __init__(). 

More2
I wanted to check Numpy's matrix dot product implementation, since my brutal force one requires O(n^3), apparently it turns out to be a more complicated subject. See here for the Stackoverflow discussion and here for the Numpy implementation. I am going to stop here since the ALTAS is apparently beyond my concern. :)


Tuesday, May 12, 2015

XML syntax

I have always been annoyed by the imperfections on webpages, e.g., weird characters. One of the websites I used a lot have the following defects:
while the other one doesn't:
I am always interested in figuring out what have caused this problem.

Last night, I discovered a new feature on my browser: view source (forgive my ignorance). So naturally I looked the source code of both websites. Apparently I didn't figure out the reason is because of my lack of knowledge on the XML:

XML has a feature called entity reference: in order to avoid errors generated from parsing special characters in XML, replace such characters with entity reference: e.g., '<' -> '&lt'
There are 5 pre-defined entity references:
source: http://www.w3schools.com/xml/xml_syntax.asp
So the answer for the previous question is: the first website parses the entity reference instead of the character itself. I believe a simple code should fix this problem, which I hope they will do it.

I don't have much knowledge on the XML, but here is some useful resources:

W3school
W3C

I probably will dive a little bit on the topic, e.g., the namespace: http://www.w3.org/1999/xhtml/

P.S. All copyright reserved to the websites which I got the snapshots from. Nevertheless, they are great websites. :)



Tuesday, May 5, 2015

Map ADT using Python

I start to go through Data Structure and Algorithm Using Python to get familiar on how to write classes using Python. The first thing is my favorite map. This structure is not the fancy hashmap, it's the simplest key - value pair structure. This map is a list implementation.
Basic structure:

a private mapEntry class which is used to store the key - value pair.

Note:
private classes, methods are identified by an underscore, e.g., _add(self, value).

add method to add key - value pairs
valueOf method to find the value of a given key, assume the key exists
remove method to remove the key - value pair given the key, assume the key exists
_iter_ method that returns an iterator of the key value pairs
a helper method findPosition that will find the position given the key, return None if no such key


__author__ = 'shirleyyoung'

class Map:
    # creates an empty map instance
    #like constructor, self: this object
    def __init__(self):
        #create empty list
        self._entryList=list()

    #when call map.len(), it will call __len__(map) and get the length of the object
    def __len__(self):
        return len(self._entryList)

    #determine if the map contains the given key
    def __contains__(self, key):
        ndx = self._findPosition(key)
        return ndx is not None

    #adds a new entry to the map if the key does not exist
    #otherwise replace the old value with the new value
    def add(self, key, value):
        ndx = self._findPosition(key)
        if ndx is not None:
            self._entryList[ndx].value = value
            return False
        else:
            entry = _MapEntry(key,value)
            self._entryList.append(entry)
            return True

    def valueOf(self, key):
        ndx = self._findPosition(key)
        assert ndx is not None, "Invalid map key"
        return self._entryList[ndx].value

    def remove(self, key):
        ndx = self._fndPosition(key)
        assert ndx is not None, "Invalid map key."
        self._entryList.pop(ndx)

    #return an iterator for traversing the keys in the map
    def __iter__(self):
        return iter(self._entryList)

    #find the index position of a category
    #if key is not found, none is returned
    def _findPosition(self, key):
        for i in range(len(self)):
            if self._entryList[i].key == key:
                return i
        return None

#private class
class _MapEntry:
    def __init__(self, key, value):
        self.key = key
        self.value = value

map = Map()
map.add(1, 1)
map.add(2, 2)
map.add(3, 3)
for entry in iter(map):
    print(entry.key)
print(len(map))
print(map.valueOf(1))

If you are interested in how Python implements dictionary, click here.

*********************************************************************************
I become interested in viewing source code since the time I was looking for jobs. And besides the one I mentioned in my previous couple blogs, I found two more interesting website.

GrepCode: this one provides Java src as well as interesting projects written in Java, e.g., Hadoop, Android.
searchcode: this one is neater, and provides code/projects written in various languages from C++ to Python. They also provides code from multiple sources, e.g., Github, Google Code, BitBucket (I didn't know this one before), etc. Interestingly, they also have an API.
Codatlas: this delicate website is the one I referred in couple of my previous posts. They claim to be IDE based, which is pretty impressive. One interesting feature is that they allow you to upload your own code. However, most of their projects are from Github, it could be better if they provide more projects from more sources.