DIY: UTF-8 Validation

Solve the interview question "UTF-8 Validation" in this lesson.

Problem description

Given an integer array data, return whether it is a valid UTF-8 encoding.

A character in UTF8 can be from 1 to 4 bytes long, subject to the following rules:

  • For a 1 byte character, the first bit of the packet is 0, followed by its Unicode code.
  • For an n-bytes character, the first n bits are all 1s, the n + 1 bit is 0, followed by n - 1 bytes, with the most significant 2 bits being 10.

This is how the UTF-8 encoding represents characters in specific ranges:

Char. number range (hexadecimal) UTF-8 octet sequence (binary)
0000 0000 - 0000 007F 0xxxxxxx
0000 0080 - 0000 07FF 110xxxxx 10xxxxxx
0000 0800 - 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 - 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Note: The input is an array of integers. Only the least significant 8 bits of each integer are used to store the data. This means each integer represents only 1 byte of data.


The input will be a vector of integers data. The following two are example inputs to the function:

// Example - 1
data = [198, 150, 9, 8]

// Example - 2
data = [255, 129, 129, 129, 129, 129, 129, 129]


For the above input, the output will be:

// Example - 1

// Example - 2

Coding exercise

For this coding exercise, you have to implement the valid_utf8(data) function, where data represents a vector of integers. The function will return true or false depending on whether the given vector of data is valid UTF8 encoding.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.