Saturday, March 7, 2015

Check if a string is a valid UTF-8 format

After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.


This is a Google interview question for intern, after I failed to get an answer from stackoverflow, I posted my own answer.

Well, I am grateful for the comments and the answer. First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:
public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}
However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.
Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.


Here is my initial post and answer.

No comments:

Post a Comment