Monday, January 12, 2015

Knuth - Morris - Pratt Algorithm: Let's give it a fancy name, pattern match

The initiative of studying this algorithm is because of this problem from LeetCode:
Implement strStr().
Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
It is an easy level problem, yet I spend my whole night figuring out this algorithm. And yes, I missed the Golden Globe (Congratulations to Eddie Redmayne, who portrayed one of my favorite physicists Stephen Hawking).

This algorithm is indeed a very interesting one. At first glance, it is very hard to understand, and require O(m) extra space, but if you really take some effort and try to understand it, you will realize how neat it can make the whole method.

Partial Match Table
Ok, start from Partial Match Table / Failure Function (PMT).  There are lots of resources to explain it, some of them are very fancy and complicated, I put the Wikipedia link here if you want some official explanation. Mine will be simpler. The partial match table matches the longest prefix and suffix that are equal of all substrings in a string. The prefix and suffix do not include the substring.
For example, consider string ="abcabdabc" will form a PMT as shown in the following table.


For an empty substring, the length is not defined, so we choose -1. Another reason to choose -1 is that when we use this table later in the pattern match, it will be easier for index incrementation. For a substring with length equals to 1, there is no prefix or suffix that doesn't include the whole substring, so the length = 0. For substring with length equal to 2 and 3, no equal prefix and suffix is found, so lengths still equal to 0. And we move on....

So how to calculate the PMT? Take a look at the table again. For "ab", since the length of equal prefix  & suffix = 0 for the previous substring "a", we compare the first character and the last character, since they are not equal, the length is still 0. Same for "abc", now it comes to "abca", since the the first and last character are equal, the length is 1. Or it is 0 + 1. Then it comes to "abcab", we already know from the last substring that the first character equals the last character (of the last substring), now we add one more character, we compare the second character with the last character (of this substring), the length is 2, or 1 + 1, You see the trend here? Every time, we compare the next character after the longest prefix (that equals to the suffix) with the last character, if the longest prefix = 0, we compare the first character with the last one, if the longest prefix = 1, we compare the second character, which has the index 1, with the last character. However, if the characters compared are not equal, the longest prefix = 0, i.e., :


PMT[i] = (ptrn.charAt(i - 1) == ptrn.charAt(PMT[i - 1])) ? (PMT[i - 1] + 1) : 0;

Yeah, as you know, it's DP.

What is the table used for? 

Considering a text string = "abcabcabdabcabcabdabdabc" (I am making it larger to look clearer), how to find the previous pattern string we showed above? You can say we compare each character in the pattern with the text string, if any character doesn't match, we slide the pattern string down to the next character in the text string that matches the first character in the pattern string. Yes, we can do that way, but it will take O(mn) time where m is the length of the pattern string and n is the length of the text string. So, can we do better? 

Look at the above figure, now we are at index 5 and the characters don't match. Since we already know that "abcab" has the longest prefix and suffix = 2, that means"ab" matches with "ab", now if we compare the character at index 5 in the text string with the character at index 2 (from the PMT table), they match, so next time we start from the characters at index 6 in the text string and index 3 at pattern string. If they don't match, say the following string, then from the table, the longest prefix of substring "ab" is 0, so unfortunately, we need to start over.



Using this algorithm allows us to trim off lots of unnecessary comparisons, the complexity is O(m + n) compared to the brutal force implementation. However, when the length of the string increases and mismatch increases, the complexity also increases.

public class KMP {
 private int[] getPMT(String ptrn) {
  int ptrnLen = ptrn.length();
  //partial match table
  int[] PMT = new int[ptrnLen + 1];
  PMT[0] = -1;
  PMT[1] = 0;
  for (int i = 2; i <= ptrnLen; i++) {
   PMT[i] = (ptrn.charAt(i - 1) == ptrn.charAt(PMT[i - 1])) ? (PMT[i - 1] + 1) : 0;
   System.out.println(i + ": " + PMT[i]);
  }
  return PMT;
  
     
    
 }
 public List searchSubString(String text, String ptrn) {
  if (text == null || ptrn == null)
   throw new NullPointerException("Null String(s)!");
  List rst = new ArrayList ();
  if (ptrn.length() == 0) {
   rst.add(0);
   return rst;
  }
  if (text.length() == 0 || text.length() < ptrn.length()) {
    return rst;
  }
  
  int indexT = 0;
  int indexP = 0;
  int ptrnLen = ptrn.length();
  int txtLen = text.length();
  int[] PMT = getPMT(ptrn);
  while (indexT < txtLen) {
   while (indexP >= 0 && text.charAt(indexT) != ptrn.charAt(indexP)) {
    indexP = PMT[indexP];
   }
   indexP++;
   indexT++;
   if (indexP == ptrnLen) {
    rst.add(indexT - ptrnLen);
    indexP = PMT[indexP];
   }
  }
  return rst;
 }


Source code can be found here: https://github.com/shirleyyoung0812/Knuth-Morris-Pratt-Algorithm.git

No comments:

Post a Comment