Implicit Conversions
Explore the impact of implicit conversions on Unicode and octet sequences in Perl. Understand how different string encodings interact during concatenation and the common pitfalls that arise. Learn the importance of decoding input and encoding output properly to avoid subtle bugs in handling Unicode data.
Unicode problems
Most Unicode problems in Perl arise because a string could be either a sequence of octets or a sequence of characters. Perl allows us to combine these types through the use of implicit conversions. When these conversions are wrong, they’re rarely obviously wrong, but they’re often spectacularly wrong in difficult ways to debug.
Concatenation
When Perl concatenates a sequence of octets with a sequence of Unicode characters, it implicitly decodes the octet sequence using the Latin-1 encoding. The resulting string will contain Unicode characters. When we print Unicode characters, Perl will encode the string using UTF-8, since Latin-1 can’t represent the entire set of Unicode characters—because Latin-1 is a subset of UTF-8.
The asymmetry between encodings and octets can lead to Unicode strings ...