# Background
In the world of computers, data is transferred through only two bits: 1(high) and 0(low). Therefore, our high-level data needs to be *encoded* with low-level data so that our machines can understand, manipulate, and communicate it.

Traditionally, the most common encoding system has been ASCII. **ASCII** is encoded English alphabets, numbers, symbols, and some special characters. With time, the need to incorporate more languages, symbols, characters, scripts, and even emoticons arose, and the need for a new encoding system became imminent. 

The Unicode Consortium was incorporated in January 1991 in the state of California, four years after the concept of a new character encoding, to be called **Unicode**, was broached in discussions started by engineers from Xerox (Joe Becker) and Apple (Lee Collins and Mark Davis). Fast-forward to the 21st century, the two most popular ways to encode data is by using UTF-8 and UTF-16. Below is a graph showing how UTF-8 has grown in popularity since 2006.

# Features

- **Backward Compatibility**: The first 128 characters - ranging from `0x0000` to `0x007f` - map directly onto the ASCII code point range. This means that wherever ASCII code point was used, UTF-8 can be easily replaced without any hassle.
- **Fallback and auto-detection**: There is software that supports #key# extended ASCII: character encodings are either eight-bit or larger encodings that include the standard seven-bit ASCII characters, plus additional characters. #key# encoding. These do not map directly onto UTF-8 code points. When UTF-8 detects extended ASCII, it falls back or replaces the 8-bit bytes with the appropriate code-point.
- **Libraries**: When writing code, if the need to input/output non-ASCII data arises, you will need UTF-8 support. Fortunately, there are libraries that support UTF-8, such as ICU for C, C++, and Java.
- **Self-Synchronization**: Go back to the table above -- do you notice that the leading byte starts with `11` while the rest of the continuation​ bytes start with `10`? This helps to separate character code points and avoid mistaking one character for another. An incorrect character will not be decoded if the stream of bits starts mid sequence.  


What is UTF-8?

UTF-8 efficiently encodes data, supports multiple languages, is compatible with ASCII, has auto-detection features, and facilitates error handling through self-synchronization.

Number of Bytes	Byte 1	Byte 2	Byte 3	Byte 4
1	`0xxx xxxx`
2	`110x xxxx`	`10xx xxxx`
3	`1110 xxxx`	`10xx xxxx`	`10xx xxxx`
4	`1111 0xxx`	`10xx xxxx`	`10xx xxxx`	`10xx xxxx`

What is UTF-8?

Background

Structure

Features