Thursday, May 21, 2015

Data Structures in Python compared with Java: HashMap, collision resolution

Here comes my favorite data structure in Java, the HashMap. In Python, a similar implementation is called Dictionary. Both uses hash table implementation. I don't want to go through the concept of hash table, but feel free to Wiki it.

The main difference between Java's HashMap and Python's dictionary is the collision resolution. In short, when you insert a key-value pair into the table, a hash function will "transfer" the key to a hash value which will then be mod by the length of the table and get the position of the table to insert this key-value pair. However, in some cases, especially when your hash function is NOT A GOOD ONE, multiple keys will be hashed to the same spot. So the question is, how to solve this problem?

First of all, is to write a good hash function. Now consider you are not a super genius mathematician, you want to find other ways to make up for it. And here comes the collision resolution.

Java
In Java, the HashMap implementation uses what is called separate chaining. The hash table itself is an array of linked list. When multiple keys are hashed to the same slot, they are appended to their predecessor. See the code snippet from JDK 7 implementation.

public V put(K key, V value) {
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);
        }
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key);
        int i = indexFor(hash, table.length);
        for (Entry e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }


Now here comes the searching problem, what if all keys are hashed to the same slot? Then we will have the worst case, linear search time. However, it is shown that when the load factor is less than 2, the average searching time is O(1)[1]. In Java, the load factor is 0.75, so, not too bad.

Ok, definition:
load factor = number of key-value pairs in the table / capacity of the table

If you are interested where the above code comes from and how Java implements HashMap, see here.


Python
Python developers decided to use another method: double hashing. This is a probing method. When collision occurs, we need to probe and find another empty slot for the key-value pair. There are several different ways of probing, including linear probing, quadratic probing and double hashing. In double hashing, when collision occurs, the key is hashed by a second hashing function and the result is used as a constant factor to find another slot:

slot = (hash1(key) + i * hash2(key)) % length(table)

i indicates the i-th probe. For example, if 'a' is hashed to slot 1 and it is occupied, and hash2('a') = 37, then slot = (1 + 1 * 37) % length(table). Assuming this time it's 3, but 3 is also occupied, then i = 2, and we calculate another slot.

The advantage of double hashing is that multiple keys that are hashed to the same slot will not have the same probing sequence, thus reduces what is called "clustering" problem. In short, the chance of each slot get selected is more uniformly distributed across the table compare to linear and quadratic probing.

However, this means hash1 should not equal hash2. It is shown that when load factor is between 1/2 to 2/3, using double hash allows hashing operations to be on average O(1)[1].

I don't have the CPhython dict implementation here, but I have my own code. ;D


import math

UNUSED = None
DEFAULT_INITIAL_CAPACITY = 1 << 4  # initial size of the table


class HashMap:

    loadFactor = 2/3

    def __init__(self, capacity=DEFAULT_INITIAL_CAPACITY, load_factor=loadFactor):
        self._table = [None] * capacity
        self._size = 0
        self.loadFactor = load_factor
        self._threshold = math.ceil(len(self._table) * load_factor)

    def __len__(self):
        """
        return the size of the hashmap
        :return:
        """
        return self._size

    def __contains__(self, key):
        """
        if the hashmap contains a given key
        :param key:
        :return:True if contains the value
        """

        return self._findSlot(key, True) is not None

    def add(self, key, value):
        """
        add a key-value pair, if the key already exists, update the value
        :param key:
        :param value:
        :return:
        """
        if key in self:
            slot = self._findSlot(key, True)
            self._table[slot] = _MapEntry(key, value)
            return False
        else:
            slot = self._findSlot(key, False)
            self._table[slot] = _MapEntry(key, value)
            self._size += 1
            if self._size >= self._threshold:
                self._resize()
            return True

    def valueOf(self, key):
        """
        given a key, return the value if the key exists in the table
        :param key:
        :return:
        """
        slot = self._findSlot(key, True)
        assert slot is not None, "Map does not contain such key."
        return self._table[slot].value


    def remove(self, key):
        """
        given a key, remove the key-value pair if the key exists in the table
        :param key:
        :return:
        """
        slot = self._findSlot(key, False)
        assert slot is not None, "Map does not contain such key."
        entry = self._table[slot]
        self._table.remove(entry)
        return entry.value

    def __iter__(self):
        return self

    def _findSlot(self, key, exist):
        """
        Given a key, find a spot in the table
        :param key:
        :param exist:boolean value indicates if the given key is assumed
        to be in the table or not
        :return:if exist, return the slot where the key-value pair is stored if
        the key exists in the table, otherwise return None
        if not exist, return an unused slot
        """
        slot = self._hash1(key)
        step = self._hash2(key)

        M = len(self._table)
        while self._table[slot] is not UNUSED:
            if exist and \
                    (self._table[slot].key == key):
                return slot
            else:
                slot = (slot + step) % M
        if not exist:
            return slot

    def _resize(self):
        """
        resize the table when the size reaches threshold
        :return:
        """
        origTable = self._table
        newSize = len(self._table) * 2 + 1
        self._table = [None] * newSize

        self._size = 0
        self._threshold = math.ceil(len(self._table) * self.loadFactor)

        for entry in origTable:
            if entry is not UNUSED:
                slot = self._findSlot(entry.key, False)
                self._table[slot] = entry
                self._size += 1

    def _hash1(self, key):
        """
        main hash for mapping keys to the table
        :param key:
        :return:
        """
        return abs(hash(key)) % len(self._table)

    def _hash2(self, key):
        """
        second hash for double hashing probes
        :param key:
        :return:
        """
        return 1 + abs(hash(key)) % (len(self._table) - 2)


class _MapEntry:

    def __init__(self, key, value):
        self.key = key
        self.value = value



Reference:
[1] Rance D. Necaise, "Data Structures and Algorithms Using Python". Wiley, 2010.



*********************************************************************

"I'm no fool, no, I'm not a follower
I don't take things as they come, if they bring me down
Life can be cruel, if you're a dreamer
I just wanna have some fun, don't tell me what can't be done

You know you like it but it drives you insane
You know you like it but it drives you insane
You know you like it but you're scared of the shame
What you want, what you gonna do?
You know you like it but it drives you insane
Follow me 'cause you know that you wanna feel the same
You know you like it but it drives you insane
What you want, what you gonna do?"


1 comment:

  1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete