Monday, February 9, 2015

Repeated DNA Sequences

The tricky part of this problem is that we cannot simply use a set to store or substrings: it will exceed memory limit. A smarter way is to use integers instead of strings. I use the rolling hash method, which is also used in the implement strStr() problem (see that problem for detail).


public class RepeatedDNA {
    private static final Map dna = new HashMap ();
    static {
        dna.put('A', 0);
        dna.put('C', 1);
        dna.put('G', 2);
        dna.put('T', 3);
    }
    private final int base = 29;
    public List findRepeatedDnaSequences(String s) {
        if (s == null)
            throw new NullPointerException("Null String!");
        //avoid duplicate sequence
        List rst = new ArrayList ();
        if (s.length() == 0)
            return rst;
        Set sequence = new HashSet ();
        long hashS = 0;
        for (int i = 0; i  < s.length(); i++) {
            if (i > 9) {
                hashS -= (long)Math.pow(base, 9) * dna.get(s.charAt(i - 10));
            }
            hashS = hashS * base + dna.get(s.charAt(i));
            if (i > 8 && !sequence.add(hashS)) {
                if (!rst.contains(s.substring(i - 9, i + 1)))
                    rst.add(s.substring(i - 9, i + 1));
            }
                
        }
        return rst;
    }
}

No comments:

Post a Comment