Puzzle 5: Explanation
Explore how Rust represents strings internally as UTF-8 encoded byte vectors. Understand the distinction between bytes and Unicode characters, learn how to count and access characters safely using iterators, and discover the memory implications of handling Unicode text in Rust.
We'll cover the following...
Test it out
Hit “Run” to see the code’s output.
Explanation
As the compiler above says, “Halló heimur” contains characters (including the space). Let’s step back and look at how Rust’s String type works.
The definition of an internal struct of a String is quite straightforward.
pub struct String {
vec: Vec<u8>,
}
Strings are just a vector of bytes (u8) that represent Unicode characters in an encoding called UTF-8. Rust automatically translates our strings to UTF-8.
The illustration below shows us what the encoding looks like:
Our original string, “Halló heimur”, consists of 11 ASCII characters (including the space) and 1 Latin-1 Supplement character, ó. ASCII characters require one byte to encode. Latin Supplements require two bytes.
Rust’s string encoding is smart enough not to store extra zeros for each Unicode character. If it did, the string would be a vector of char types. Rust’s char is exactly four ...