Tuesday, March 24, 2015

Duplicate documents


You have a billion urls, where each is a huge page.How do you detect the duplicate documents? 
A quick easy way to find duplicates without going through all data is to use a hash table. The key will be the page, the value will be url. If one page already exists in the table, the next one cannot be added, so it can be discarded.

Since each page is huge, so the next thing to think about is to use a hash. Any hashing method you can think of, but since the hash value is an integer, then it will take 4 byte for each url, so in total we will have less than 1 GB required memory.

However, we cannot forget each page is associated with a url. The maximum characters a url can contain is around 2000 (Internet Explore is 2083, but let's just keep it simple), in Java, each character is 2 byte, so in total 4kb. On average a url may not exceed 100 characters, so around 200 bytes? However, we are having 1b of it, so 200 GB? Technically impossible to save all of them in one machine.

Now comes our distributed system. Assume we have n machines, with each can hold x hash values. So in total we will need nx > 1b, assuming no duplicates. So now if we have a hash value v, v / n will be the machine we need to look for the data and v % n is the position to store the data.
For example, if n = 10, and v = 13. Then we will go to the machine index = 1 (index starts from 0) and put it in the 3rd position.

Note that since we are using multiple machines, we will need to deal with network latency.


No comments:

Post a Comment