Want Help Cracking FAANG?

(Then click this)

×
Back to Question Bank

393. UTF-8 Validation - Leetcode Solution

Code Implementation

class Solution:
    def validUtf8(self, data):
        n_bytes = 0
        for num in data:
            byte = num & 0xFF
            if n_bytes == 0:
                if (byte >> 5) == 0b110:
                    n_bytes = 1
                elif (byte >> 4) == 0b1110:
                    n_bytes = 2
                elif (byte >> 3) == 0b11110:
                    n_bytes = 3
                elif (byte >> 7):
                    return False
            else:
                if (byte >> 6) != 0b10:
                    return False
                n_bytes -= 1
        return n_bytes == 0
      
class Solution {
public:
    bool validUtf8(vector<int>& data) {
        int n_bytes = 0;
        for (int num : data) {
            unsigned char byte = num & 0xFF;
            if (n_bytes == 0) {
                if ((byte >> 5) == 0b110)
                    n_bytes = 1;
                else if ((byte >> 4) == 0b1110)
                    n_bytes = 2;
                else if ((byte >> 3) == 0b11110)
                    n_bytes = 3;
                else if ((byte >> 7))
                    return false;
            } else {
                if ((byte >> 6) != 0b10)
                    return false;
                n_bytes--;
            }
        }
        return n_bytes == 0;
    }
};
      
class Solution {
    public boolean validUtf8(int[] data) {
        int n_bytes = 0;
        for (int num : data) {
            int b = num & 0xFF;
            if (n_bytes == 0) {
                if ((b >> 5) == 0b110) {
                    n_bytes = 1;
                } else if ((b >> 4) == 0b1110) {
                    n_bytes = 2;
                } else if ((b >> 3) == 0b11110) {
                    n_bytes = 3;
                } else if ((b >> 7) != 0) {
                    return false;
                }
            } else {
                if ((b >> 6) != 0b10) {
                    return false;
                }
                n_bytes--;
            }
        }
        return n_bytes == 0;
    }
}
      
var validUtf8 = function(data) {
    let n_bytes = 0;
    for (let num of data) {
        let byte = num & 0xFF;
        if (n_bytes === 0) {
            if ((byte >> 5) === 0b110) {
                n_bytes = 1;
            } else if ((byte >> 4) === 0b1110) {
                n_bytes = 2;
            } else if ((byte >> 3) === 0b11110) {
                n_bytes = 3;
            } else if ((byte >> 7) !== 0) {
                return false;
            }
        } else {
            if ((byte >> 6) !== 0b10) {
                return false;
            }
            n_bytes--;
        }
    }
    return n_bytes === 0;
};
      

Problem Description

You are given an array of integers called data, where each integer represents one byte (8 bits). Your task is to determine if the given sequence of bytes forms a valid UTF-8 encoding.

The UTF-8 encoding rules are as follows:

  • For a 1-byte character, the first bit is a 0, followed by its 7-bit code.
  • For an n-byte character (where n is from 2 to 4), the first byte starts with n 1's followed by a 0, and the following n-1 bytes start with "10".
You must check if the input data is a valid sequence according to these rules. Each element of data is guaranteed to be between 0 and 255.

Note: You must not reuse bytes for multiple characters, and each character must be formed from consecutive bytes only.

Thought Process

When approaching this problem, the first instinct is to check each byte and try to match it to the UTF-8 encoding rules. For each byte, we need to determine whether it is the start of a new character or a continuation byte. Recognizing the start and continuation bytes is crucial.

A brute-force approach might involve trying to parse all possible groupings of bytes, but this quickly becomes inefficient and complex. Instead, we can process the array from left to right and keep track of how many continuation bytes we expect after seeing a leading byte.

By counting the number of leading 1's in the first byte, we know how many bytes the character should have. Then, we ensure the following bytes start with "10". If at any point the pattern is broken, or we run out of bytes, the encoding is invalid.

Solution Approach

The solution proceeds as follows:

  1. Initialize a counter: Use a variable (e.g., n_bytes) to keep track of how many continuation bytes are expected.
  2. Process each byte: For each byte in data:
    • If n_bytes == 0, this is a leading byte. Count the number of leading 1's to determine the total number of bytes for this character.
    • If the leading byte has:
      • 0 leading 1's: it's a single-byte character.
      • 2-4 leading 1's: it's the start of a multi-byte character. Set n_bytes accordingly (number of continuation bytes needed).
      • 1 or more than 4 leading 1's: invalid according to UTF-8.
    • If n_bytes > 0, this byte must be a continuation byte, i.e., it must start with "10". If not, return false. Decrement n_bytes.
  3. Final check: After all bytes are processed, n_bytes must be zero. If not, the sequence is incomplete/invalid.

Bitwise operations are used to efficiently check the leading bits of each byte.

Example Walkthrough

Let's consider data = [197, 130, 1].

  1. First byte: 197 (binary: 11000101)
    • Starts with "110", so this is the start of a 2-byte character.
    • Set n_bytes = 1 (expecting 1 continuation byte).
  2. Second byte: 130 (binary: 10000010)
    • Starts with "10", so it's a valid continuation byte.
    • Decrement n_bytes to 0.
  3. Third byte: 1 (binary: 00000001)
    • Starts with "0", so it's a valid single-byte character.
    • n_bytes remains 0.
  4. End of input: n_bytes is 0, so the sequence is valid.

Time and Space Complexity

Brute-force approach: If you tried all possible groupings, the time complexity would be exponential, which is infeasible for large inputs.

Optimized approach: The above algorithm processes each byte exactly once, performing constant-time operations for each. Thus, the time complexity is O(N), where N is the length of data.

The space complexity is O(1) because only a few integer variables are used, regardless of the input size.

Summary

The key to solving the UTF-8 Validation problem is understanding the encoding rules and using bitwise operations to efficiently check each byte's validity. By keeping track of how many continuation bytes are expected, we can process the input in a single pass, making the solution both efficient and elegant. This approach avoids unnecessary complexity and leverages the structure of UTF-8 encoding for a straightforward validation algorithm.