Quick, how many bytes make up the following line? No tricks, I promise.
Hello!
The correct answer is: 6. Or 7, if you want to be pedantic and include the newline, but let’s not.
This is simple enough; this page is encoded as UTF-8, which implies 8-bits per ASCII character, or a byte per character.
Let’s play again. How many bytes make up the following line?
Hello!!
Easy enough. 7. 7 characters, so 7 bytes.
$ echo -n 'Hello!!' | wc -c
$ 7
One more time. What about the next line?
Hеllo!!!
Did you guess 8? Or perhaps you realized I was trying to trick you.
Well, I was trying to trick you.
The correct answer is… 9! 9 bytes.
Don’t believe me? Go ahead and copy it from this page, and run a quick check:
$ echo -n 'Hеllo!!!' | wc -c
$ 9
Huh?? Where’d that extra byte come from? Well, the truth is, there’s an imposter in that line.
That’s right, it was the ol' UTF-8 Cryillic ‘е’ trick. No one ever expects U+0435. If it’s your first time encountering this, a brief explanation: yes, UTF-8 is one-byte-per-character for the first 128 characters, correlating perfectly to traditional ASCII. But the full Unicode standard, which UTF-8 encodes, includes 143,859 characters as of version 13.0. To represent the full spectrum, UTF-8 must use multiple bytes for some characters (technically, most characters). It just so happens that this vast range of characters includes some look-alikes, like your friendly neighborhood 1-byte ASCII “e” and your less-loved but still friendly 2-byte Cryillic “е”.
UTF-8 uses one to four bytes to encode each Unicode code point.