Thursday, October 27, 2016

UTF-8 Validation

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
  1. For 1-byte character, the first bit is a 0, followed by its unicode code.
  2. For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
This is how the UTF-8 encoding would work:
   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Given an array of integers representing the data, return whether it is a valid utf-8 encoding.
Note:
The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example 1:
data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:
data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

There should be lots of ways to do it. Mine is a very straightforward one: change input integer to binary strings and follow the encoding rules and check all numbers.

public boolean validUtf8(int[] data) {
        if (data.length == 0) {
            return false;
        }
        int len = data.length;
        String[] bytes = new String[len];
        for (int i = 0; i < len; i++) {
            int num = data[i];
            //1 byte should be at most 255
            if (num > 255) {
                return false;
            }
            bytes[i] = Integer.toBinaryString(num);
            //use 0 to make up bits
            while (bytes[i].length() < 8) {
                bytes[i] = "0" + bytes[i];
            }
        }
        
        int pos = 0;
        int leftLength = len;
        while (pos < len && leftLength > 0) {
            String currByte = bytes[pos];
            int n = 0;
            if (currByte.charAt(0) == '0') {
                //supposed to be 1 bit
                n = 1;
                pos++;
            } else {
                while (n < currByte.length() && currByte.charAt(n) == '1') {
                    n++;
                }
                if (n == 1) {
                    return false;
                }
                //Left bit strings in the array should at least be n
                if (leftLength < n) {
                    return false;
                }
                int curr = 1;
                pos++;
                while (curr < n) {
                    if (!"10".equals(bytes[pos].substring(0, 2))) {
                        return false;
                    }
                    curr++;
                    pos++;
                }
            }
            leftLength -= n;
        }
        return true;
    }


1 comment:

  1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete