In 2006, Robert Andersen sent the first tweet that @mentioned another user, and an internet convention was born.

In a world without smartphones, Twitter’s primary interface was SMS. There were no threads, no @username autocomplete. If a user said something and you wanted to reply, your only option was to compose a new text to 40404, manually typing the @mention into your phone’s SMS text box. SMS was also the origin of Twitter’s original 140 character limit:

Twitter began as an SMS text-based service. This limited the original Tweet length to 140 characters (which was partly driven by the 160 character limit of SMS, with 20 characters reserved for commands and usernames).

Also a holdover from the SMS days, Twitter does not allow any formatted or rich text. Users today work around this limitation by using the wide array of possibilities that Unicode affords. Text generators convert ASCII characters into 𝔲𝔫𝔲𝔰𝔲𝔞𝔩 𝔘𝔫𝔦𝔠𝔬𝔡𝔢 𝔠𝔥𝔞𝔯𝔞𝔠𝔱𝔢𝔯𝔰. Memes take advantage of repeated spaces to position text, or to draw Unicode houses. (All of that is, of course, completely inaccessible to screen readers.) Emoji modifiers allow expressing a wide range of emotions with modifiers for skin tones and gender.

Critically, this “plain” text is fully copy-pastable into any app with Unicode support. While many Twitter users from 2006 were likely limited to the GSM-7 encoding, the limitations of the T9 keyboard, and their phone’s limited text rendering and shaping, modern users have found new ways to express themselves using the full Unicode character space. Nowadays, tweets, names, URLs, hashtags, basically everything can contain Unicode characters. Of course, everything except the @mention.

Usernames on Twitter have more or less the same requirements as during the SMS era — up to 15 letters from a-z, numbers from 0-9 and the underscore _. This requirement is similar to many other websites: GitHub only allows alphanumeric characters and non-repeating hyphens, and Facebook allows just alphanumeric characters and the period.

As we will see, this limitation is one of several problems with usernames.

Unicode Usernames

Imagine if instead of alphanumerics, GitHub allowed usernames to only be made up of the numbers 0-9 and Chinese characters. It would be pretty frustrating, right? If forced to use a system like this, we would likely ignore the Chinese characters, and just use the numbers as usernames. This is often what happens on social media platforms in countries that don’t use Latin script. In China, the QQ instant messaging service (originally started in 1999) completely got rid of usernames used by similar services like AOL. Instead, each user is assigned a unique number, their QQ ID. This number cannot be chosed by the user, and is immutable. WeChat similarly assigns users a random number when they sign up, although it does allow each user exactly one opportunity to change it to an alphanumeric string of their choice. Typing long numbers isn’t particularly ergonomic, which could help explain the widespread use of QR codes in China.

Even in Latin-speaking countries, AOL-style usernames had issues. By requiring a username when logging in, and also requiring usernames to be unique among all users, users are often forced to use different usernames on different services, creating a second password that they had to memorize. Thankfully, most services have fixed at least this small problem by having login be email- instead of username-based.

Throughout the world, usernames are alphanumeric only. Why can’t we just allow Unicode, so that usernames could contain Chinese, Cyrillic, or other non-Latin alphabets? Unfortunately Unicode usernames have their own issues — just look at the largest username system in the world, the domain name system and the extension that allowed non-ASCII characters. This was an essential upgrade, and yet it has inperfections, allowing the registration of websites such as this one:

This is latest Firefox, as of July 2020, displaying what appears to be “epic.com”, the website of a large healthcare company. However, as you can see, the content is actually some random blog reminding people to drink water. How is this possible? If we stick the code into a Unicode character inspector, we can see what’s going on:

Byte-wise, the real “epic.com” and the false website “еріс.com” are completely different. But visually, they’re indistinguishable from each other in the URL bar, allowing phishing problems to run amock. Unicode canonicalization and normalization can help with certain cases of this problem, but does nothing for our epic.com example.

This particular example isn’t visible in Chrome, which instead shows https://xn--e1awd7f.com/, the “punycode” representation of the domain name. This is thanks to Chrome’s complex, 13 step process for detecting if a domain name is likely to be a Unicode phish or not. “Well, it may be complex,” you tell me, “but at least it solves the phishing problem!” Unfortunately it does not.

Specific instances of IDN homograph attacks have been reported to Chrome, and we continually update our IDN policy to prevent against these attacks.

The Unicode spec is apparently too large to solve this problem 100% perfectly, and so their “solution” is to pay $2000 to anybody who finds new edge cases. This also doesn’t actually solve the problem for non-Latin alphabets — if for example, I own a Chinese domain name, it will never show punycode, and attackers can phish my site using duplicate encodings for those Chinese characters. Chrome just attempts to solve the much smaller problem of the numerous Unicode characters that visually look like the Latin alphabet.