The main difference between Java's HashMap and Python's dictionary is the collision resolution. In short, when you insert a key-value pair into the table, a hash function will "transfer" the key to a hash value which will then be mod by the length of the table and get the position of the table to insert this key-value pair. However, in some cases, especially when your hash function is NOT A GOOD ONE, multiple keys will be hashed to the same spot. So the question is, how to solve this problem?
First of all, is to write a good hash function. Now consider you are not a super genius mathematician, you want to find other ways to make up for it. And here comes the collision resolution.
Java
In Java, the HashMap implementation uses what is called separate chaining. The hash table itself is an array of linked list. When multiple keys are hashed to the same slot, they are appended to their predecessor. See the code snippet from JDK 7 implementation.
public V put(K key, V value) { if (table == EMPTY_TABLE) { inflateTable(threshold); } if (key == null) return putForNullKey(value); int hash = hash(key); int i = indexFor(hash, table.length); for (Entrye = table[i]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; addEntry(hash, key, value, i); return null; }
Now here comes the searching problem, what if all keys are hashed to the same slot? Then we will have the worst case, linear search time. However, it is shown that when the load factor is less than 2, the average searching time is O(1)[1]. In Java, the load factor is 0.75, so, not too bad.
Ok, definition:
load factor = number of key-value pairs in the table / capacity of the table
If you are interested where the above code comes from and how Java implements HashMap, see here.
Python
Python developers decided to use another method: double hashing. This is a probing method. When collision occurs, we need to probe and find another empty slot for the key-value pair. There are several different ways of probing, including linear probing, quadratic probing and double hashing. In double hashing, when collision occurs, the key is hashed by a second hashing function and the result is used as a constant factor to find another slot:
slot = (hash1(key) + i * hash2(key)) % length(table)
i indicates the i-th probe. For example, if 'a' is hashed to slot 1 and it is occupied, and hash2('a') = 37, then slot = (1 + 1 * 37) % length(table). Assuming this time it's 3, but 3 is also occupied, then i = 2, and we calculate another slot.
The advantage of double hashing is that multiple keys that are hashed to the same slot will not have the same probing sequence, thus reduces what is called "clustering" problem. In short, the chance of each slot get selected is more uniformly distributed across the table compare to linear and quadratic probing.
However, this means hash1 should not equal hash2. It is shown that when load factor is between 1/2 to 2/3, using double hash allows hashing operations to be on average O(1)[1].
I don't have the CPhython dict implementation here, but I have my own code. ;D
import math UNUSED = None DEFAULT_INITIAL_CAPACITY = 1 << 4 # initial size of the table class HashMap: loadFactor = 2/3 def __init__(self, capacity=DEFAULT_INITIAL_CAPACITY, load_factor=loadFactor): self._table = [None] * capacity self._size = 0 self.loadFactor = load_factor self._threshold = math.ceil(len(self._table) * load_factor) def __len__(self): """ return the size of the hashmap :return: """ return self._size def __contains__(self, key): """ if the hashmap contains a given key :param key: :return:True if contains the value """ return self._findSlot(key, True) is not None def add(self, key, value): """ add a key-value pair, if the key already exists, update the value :param key: :param value: :return: """ if key in self: slot = self._findSlot(key, True) self._table[slot] = _MapEntry(key, value) return False else: slot = self._findSlot(key, False) self._table[slot] = _MapEntry(key, value) self._size += 1 if self._size >= self._threshold: self._resize() return True def valueOf(self, key): """ given a key, return the value if the key exists in the table :param key: :return: """ slot = self._findSlot(key, True) assert slot is not None, "Map does not contain such key." return self._table[slot].value def remove(self, key): """ given a key, remove the key-value pair if the key exists in the table :param key: :return: """ slot = self._findSlot(key, False) assert slot is not None, "Map does not contain such key." entry = self._table[slot] self._table.remove(entry) return entry.value def __iter__(self): return self def _findSlot(self, key, exist): """ Given a key, find a spot in the table :param key: :param exist:boolean value indicates if the given key is assumed to be in the table or not :return:if exist, return the slot where the key-value pair is stored if the key exists in the table, otherwise return None if not exist, return an unused slot """ slot = self._hash1(key) step = self._hash2(key) M = len(self._table) while self._table[slot] is not UNUSED: if exist and \ (self._table[slot].key == key): return slot else: slot = (slot + step) % M if not exist: return slot def _resize(self): """ resize the table when the size reaches threshold :return: """ origTable = self._table newSize = len(self._table) * 2 + 1 self._table = [None] * newSize self._size = 0 self._threshold = math.ceil(len(self._table) * self.loadFactor) for entry in origTable: if entry is not UNUSED: slot = self._findSlot(entry.key, False) self._table[slot] = entry self._size += 1 def _hash1(self, key): """ main hash for mapping keys to the table :param key: :return: """ return abs(hash(key)) % len(self._table) def _hash2(self, key): """ second hash for double hashing probes :param key: :return: """ return 1 + abs(hash(key)) % (len(self._table) - 2) class _MapEntry: def __init__(self, key, value): self.key = key self.value = value
Reference:
[1] Rance D. Necaise, "Data Structures and Algorithms Using Python". Wiley, 2010.
*********************************************************************
"I'm no fool, no, I'm not a follower
I don't take things as they come, if they bring me down
Life can be cruel, if you're a dreamer
I just wanna have some fun, don't tell me what can't be done
You know you like it but it drives you insane
You know you like it but it drives you insane
You know you like it but you're scared of the shame
What you want, what you gonna do?
You know you like it but it drives you insane
Follow me 'cause you know that you wanna feel the same
You know you like it but it drives you insane
What you want, what you gonna do?"
The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.
ReplyDeleteProjects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.
Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.