class Solution:
def validUtf8(self, data):
n_bytes = 0
for num in data:
byte = num & 0xFF
if n_bytes == 0:
if (byte >> 5) == 0b110:
n_bytes = 1
elif (byte >> 4) == 0b1110:
n_bytes = 2
elif (byte >> 3) == 0b11110:
n_bytes = 3
elif (byte >> 7):
return False
else:
if (byte >> 6) != 0b10:
return False
n_bytes -= 1
return n_bytes == 0
class Solution {
public:
bool validUtf8(vector<int>& data) {
int n_bytes = 0;
for (int num : data) {
unsigned char byte = num & 0xFF;
if (n_bytes == 0) {
if ((byte >> 5) == 0b110)
n_bytes = 1;
else if ((byte >> 4) == 0b1110)
n_bytes = 2;
else if ((byte >> 3) == 0b11110)
n_bytes = 3;
else if ((byte >> 7))
return false;
} else {
if ((byte >> 6) != 0b10)
return false;
n_bytes--;
}
}
return n_bytes == 0;
}
};
class Solution {
public boolean validUtf8(int[] data) {
int n_bytes = 0;
for (int num : data) {
int b = num & 0xFF;
if (n_bytes == 0) {
if ((b >> 5) == 0b110) {
n_bytes = 1;
} else if ((b >> 4) == 0b1110) {
n_bytes = 2;
} else if ((b >> 3) == 0b11110) {
n_bytes = 3;
} else if ((b >> 7) != 0) {
return false;
}
} else {
if ((b >> 6) != 0b10) {
return false;
}
n_bytes--;
}
}
return n_bytes == 0;
}
}
var validUtf8 = function(data) {
let n_bytes = 0;
for (let num of data) {
let byte = num & 0xFF;
if (n_bytes === 0) {
if ((byte >> 5) === 0b110) {
n_bytes = 1;
} else if ((byte >> 4) === 0b1110) {
n_bytes = 2;
} else if ((byte >> 3) === 0b11110) {
n_bytes = 3;
} else if ((byte >> 7) !== 0) {
return false;
}
} else {
if ((byte >> 6) !== 0b10) {
return false;
}
n_bytes--;
}
}
return n_bytes === 0;
};
You are given an array of integers called data
, where each integer represents one byte (8 bits). Your task is to determine if the given sequence of bytes forms a valid UTF-8 encoding.
The UTF-8 encoding rules are as follows:
data
is a valid sequence according to these rules. Each element of data
is guaranteed to be between 0 and 255.
Note: You must not reuse bytes for multiple characters, and each character must be formed from consecutive bytes only.
When approaching this problem, the first instinct is to check each byte and try to match it to the UTF-8 encoding rules. For each byte, we need to determine whether it is the start of a new character or a continuation byte. Recognizing the start and continuation bytes is crucial.
A brute-force approach might involve trying to parse all possible groupings of bytes, but this quickly becomes inefficient and complex. Instead, we can process the array from left to right and keep track of how many continuation bytes we expect after seeing a leading byte.
By counting the number of leading 1's in the first byte, we know how many bytes the character should have. Then, we ensure the following bytes start with "10". If at any point the pattern is broken, or we run out of bytes, the encoding is invalid.
The solution proceeds as follows:
n_bytes
) to keep track of how many continuation bytes are expected.
data
:
n_bytes == 0
, this is a leading byte. Count the number of leading 1's to determine the total number of bytes for this character.n_bytes
accordingly (number of continuation bytes needed).n_bytes > 0
, this byte must be a continuation byte, i.e., it must start with "10". If not, return false. Decrement n_bytes
.n_bytes
must be zero. If not, the sequence is incomplete/invalid.
Bitwise operations are used to efficiently check the leading bits of each byte.
Let's consider data = [197, 130, 1]
.
n_bytes = 1
(expecting 1 continuation byte).n_bytes
to 0.n_bytes
remains 0.n_bytes
is 0, so the sequence is valid.
Brute-force approach: If you tried all possible groupings, the time complexity would be exponential, which is infeasible for large inputs.
Optimized approach: The above algorithm processes each byte exactly once, performing constant-time operations for each. Thus, the time complexity is O(N), where N is the length of data
.
The space complexity is O(1) because only a few integer variables are used, regardless of the input size.
The key to solving the UTF-8 Validation problem is understanding the encoding rules and using bitwise operations to efficiently check each byte's validity. By keeping track of how many continuation bytes are expected, we can process the input in a single pass, making the solution both efficient and elegant. This approach avoids unnecessary complexity and leverages the structure of UTF-8 encoding for a straightforward validation algorithm.