The Fake Geek's blog: Min Hash

My friend threw me a question "How to measure the similarity between two tweets" and asked me to take a look of MinHash. I didn't connect these two together until I started to go over MinHash today, it is kind of fun, and consider NLP is one of my interests, I couldn't say it's not helpful.

There are quite a few highly mathematical explanations about MinHash, if you want to dig deeper to the math, go through the following links:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
http://www.cs.cmu.edu/~guyb/realworld/slidesS13/minhash.pdf
http://blog.cluster-text.com/tag/minhash/

However, I would highly recommend this blog, which uses rather plain language and really easy to understand.

Measuring the similarity
When my friend first threw me that question, I told him, well, consider the two tweets as two vector of words with dimension d in the vector space, and measure the distance between them. This is not a wrong approach, and works fine with tweets. However, here is the problem:

1. How to measure the distance?
Definitely you can say any measuring method that is on top of your mind (Euclidean, Cosine, Edit, etc.). However, here we are talking about two text contents, how to pick a logical one?

2. d
Yeah, that is called the "curse of dimensionality". The maximum of tweets is 140 words, so the d wouldn't exceed 140 (Still high enough). However, now if we want to check if my paper is plagiarism or not, since the shortest paper I have heard is 3000 words (a letter, which I have never succeeded in writing one), we are dealing with d in the range of 3000 to 10000. Besides lots of calculations, it's is highly possible we cannot find any similarity.

Shingles
So here comes the first concept: Shingle. Each shingle contains a fixed number of words, and we will have n - shingle length + 1 shingles given n is the number of words in the text.

So the next question is, how long should a shingle be? In a thumb of rule,

k should be picked large enough that the probability of any given shingle appearing in any given document is low.

where k is the length of the shingles. In general, k = 5 would work well with a normal length e-mail while k = 9 is acceptable for a research paper.

But what exactly is shingle? Use an example I learned from the above mentioned blog:

Given the text "Dora is a stupid lovely happy dog and Jesse is a smart ugly boring cat". The shingles given k = 5 would be:

Dora is a stupid lovely
is a stupid lovely happy
a stupid lovely happy dog
stupid lovely happy dog and
lovely happy dog and Jesse
happy dog and Jesse is
dog and Jesse is a
and Jesse is a smart
Jesse is a smart ugly
is a smart ugly boring
a smart ugly boring cat

As you can see, when we break down the text to shingles, the meanings of some contents deviate from the original text. This is kind of interesting!

And this is the set of words we are going to use to measure the similarity.

But you still haven't answer those two questions?

Random shingle selections and MinHash
If I write a 10000 words paper, even if I break down my paper to shingles, I will still have ... 9996 shingles, not helping a lot here. Alternatively, we can randomly select 200 shingles from all shingles we have, and use that as our new set. That will shrink our set to only 2 % compared to the original one! That reduces lots of workload. And that's the reason when we choose the length of the shingle, we need to make it long enough: to assure randomness.

In Java, a string is considered an object, so storing 200 strings is still not a quite good idea. That's when the hash functions come and rescue the world. It's an integer, so it takes constant space. From the blog, the way they do it is like this:

Break down the text to shingles;
Calculate the hash value for each shingle;
Store the minimum hash value among all hash values;
Repeat step 2 and step 3 for desired number of hash values, e.g. 200.

Since hashing is another way of random selection (uniform probability for small and large numbers), so hashing 200 times (using two different hash functions) is the same as randomly select 200 shingles.

But I am not a cryptography expert, in fact I only know at most 2 hash functions, and if I steal one from Java, I still need... 197...

Well, one way to do it is:

you XOR the value returned by String.hashCode() with 199 random numbers to generate the 199 other hash code values.

See why it works, here is the answer.

Locality Sensitive Hashing (LSH)
If we only need to compare the similarity between two items, we are good now. However, consider a typical clustering algorithm that we need to measure similarities among lots of items, say Nearest Neighbor?
Here comes the MinHash using LSH. LST requires a hash function to divide input into large number of buckets. The similar items will be hashed into the same bucket. The idea of LSH based on MinHash is to divide the hash signatures into b bands. Hash the columns in each band with a basic hash function. If text a and text b have same values in a band, they will be hashed into the same bucket in that band. So if we have in total 200 hash signatures, and we divide them into 10 bands, so in each band, we need to calculate 20 hash codes for each document. If any documents in any band have the same hash values, they need to be considered in the same bucket, and need further comparison.

Source: http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html

I stole the above figure from the blog I mentioned. In this example, we need to compare document one and document three further since they have the same hash values. If in band two, for example, document one and document four have the same hash values, we need to compare document one and document four too.

Wait, how to measure the similarity???

Here comes the most important question. The similarity is measured by the so called Jaccard similarity:

Using MinHash, it becomes the equal elements in hash(A) and hash(B) / total elements

Code snippet

The following code is an example. For simplicity, this code only compares similarity between two set of strings, so I didn't write the singles part and the LSH part.

import java.util.*;
public class MinHash {
 //numHash: number of hash functions needed 
 // in this small case, we set it size of text a + size of text b in the test case
 public double similarity(Set text1, Set text2, int numHash){
  
  long[][] minHashValues = new long[2][numHash];
  Arrays.fill(minHashValues[0], Long.MAX_VALUE);
  Arrays.fill(minHashValues[1], Long.MAX_VALUE);
  Random r = new Random(63689);
  int similarity = 0;
  for (int i = 0; i < numHash; i++){
   int a = r.nextInt() + 1;
   for(String s : text1)
    minHashValues[0][i] = Math.min(minHashValues[0][i], getHash(s.hashCode(), a, i));
   for(String s : text2)
    minHashValues[1][i] = Math.min(minHashValues[1][i], getHash(s.hashCode(), a, i)); 
   if(minHashValues[0][i] == minHashValues[1][i]){
    similarity++;
   }
  }
  return (double)similarity / numHash;
 }
 //using circular shifts: http://en.wikipedia.org/wiki/Circular_shift
 //http://stackoverflow.com/questions/5844084/java-circular-shift-using-bitwise-operations
 //circular shifts XOR random number
 private long getHash(int value, int random, int shift){
  //the first hash function comes from string.hashCode()
  //http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/master/src/share/classes/java/lang/String.java?keyword=String&line=1494
  if(shift == 0)
   return value;
  int rst = (value >>> shift) | (value << (Integer.SIZE - shift));
  return rst ^ random;
 }
}
public class MinHashTester {

 public static void main(String[] args) {
  Set text1 = new HashSet ();
  text1.add("Dora");
  text1.add("is");
  text1.add("a");
  text1.add("stupid");
  text1.add("lovely");
  text1.add("happy");
  text1.add("puppy");
  Set text2 = new HashSet ();
  text2.add("Dora");
  text2.add("is");
  text2.add("a");
  text2.add("stupid");
  text2.add("lovely");
  text2.add("happy");
  text2.add("puppy");
  
  Set text3 = new HashSet ();
  text3.add("Dora");
  text3.add("the");
  text3.add("happy");
  text3.add("puppy");
  text3.add("loves");
  text3.add("Shirley");
  
  Set text4 = new HashSet ();
  text4.add("Dora");
  text4.add("stupid");
  text4.add("is");
  text4.add("lovely");
  text4.add("happy");
  text4.add("a");
  text4.add("puppy");
  MinHash mh = new MinHash();
  System.out.println(String.format("%.3f", mh.similarity(text1, text2, text1.size() + text2.size())));
  System.out.println(String.format("%.3f", mh.similarity(text1, text3, text1.size() + text3.size())));
  System.out.format("%.3f", mh.similarity(text1, text4, text1.size() + text4.size()));
  
  
 }

}

The output from 3 tests are here:

Apparently the order doesn't affect the total similarity, since we are only storing the minimum hash values among all strings. Note that in reality, we need to deal with singles, not bag of words as showing in the code, because order DOES matter.

Moreover, I am always interested in hashing, even though I have never taken any classes on cryptography. Here is how Java implements hash in HashMap.hash() and String.hashCode().

The Fake Geek's blog

AdSense

Monday, March 23, 2015

Min Hash

No comments:

Post a Comment