The Relation to String Slices

This next bit isn’t strictly necessary to understand most programming in Rust, but I think it’s helpful.

There’s a data type we haven’t directly talked about yet, called a char. It represents a single character. This could be the letter A, or the @ sign, or the Hebrew letter Alef (א), or many other things. We’ll get back to that part. In Rust (and many other languages), a character literal is a character surrounded by single quotes, e.g., 'A'.

A string, logically, is a sequence of characters. You can think of "Hello" as ['H', 'e', 'l', 'l', 'o']. However, that’s not the way Rust actually represents a string. Instead, it does something totally different. Let’s see why.

There are a lot of characters in the world. I mentioned the Latin alphabet, like the letter A. I mentioned symbols, like the @ sign. I mentioned other languages, like Hebrew. There are many, many thousands of characters we would like to deal with. There’s a group called Unicode, that gives a numeric representation for all of these characters. For example:

A is U+0041
א is U+05D0
😼 is U+1F63C

As of Unicode 12.1, there are 137,994 characters in Unicode. The question then is, how big must a char be to hold all of those different potential values? It turns out that the minimum size is 4 bytes.

If we used that array-of-chars representation for a string, then the word “Hello” would take up 20 bytes. Considering that a large amount of the work that computers do involves mostly-Latin data, it would be nice to make that smaller.

Thankfully, there’s an encoding called UTF-8 that helps with that. An encoding says how to represent a sequence of characters as binary data. UTF-8 is cool for many reasons, but for our purpose specifically, it takes only 1 byte to store each Latin character. For many other common scripts, it uses only 2 or 3 bytes. And the most it ever uses is 4 bytes. So UTF-8 never takes up more memory than an array of chars, and usually takes up much less.

Here is a demonstration of how the English versus Russian words for Hello are encoded as characters versus bytes:

String literal	“Hello”	“Aллo”
As characters	[‘H’, ‘e’, ‘l’, ‘l’, ‘o’]	[‘A’, ‘л’, ‘л’, ‘o’]
As bytes	[72, 101, 108, 108, 111]	[208, 144, 208, 187, 208, 187, 208, 190]

Get hands-on with 1200+ tech skills courses.

Welcome to Begin Rust

Hello World and variables

Anatomy of Rust

Functions

Booleans

Conditionals

Mutable variables and while loops

Structs and ownership

References

Methods

Strings

Arrays, Vecs, and slices

Type parameters

Traits

Enums

Iterators and for loops

Early exit

What’s next?

The Relation to String Slices