Prelude: e and e

Quick, how many bytes make up the following line? No tricks, I promise.

Hello!

The correct answer is: 6. Or 7, if you want to be pedantic and include the newline, but let’s not.

This is simple enough; this page is encoded as UTF-8, which implies 8-bits per ASCII character, or a byte per character.

Let’s play again. How many bytes make up the following line?

Hello!!

Easy enough. 7. 7 characters, so 7 bytes.

$ echo -n 'Hello!!' | wc -c
$ 7

One more time. What about the next line?

Hеllo!!!

Did you guess 8? Or perhaps you realized I was trying to trick you.

Well, I was trying to trick you.

The correct answer is… 9! 9 bytes.

Don’t believe me? Go ahead and copy it from this page, and run a quick check:

$ echo -n 'Hеllo!!!' | wc -c
$ 9

Huh?? Where’d that extra byte come from? Well, the truth is, there’s an imposter in that line.

That’s right, it was the ol' UTF-8 Cryillic ‘е’ trick. No one ever expects U+0435. If it’s your first time encountering this, a brief explanation: yes, UTF-8 is one-byte-per-character for the first 128 characters, correlating perfectly to traditional ASCII. But the full Unicode standard, which UTF-8 encodes, includes 143,859 characters as of version 13.0. To represent the full spectrum, UTF-8 must use multiple bytes for some characters (technically, most characters). It just so happens that this vast range of characters includes some look-alikes, like your friendly neighborhood 1-byte ASCII “e” and your less-loved but still friendly 2-byte Cryillic “е”.

UTF-8 uses one to four bytes to encode each Unicode code point.

Enter: emojis