IDNs in Lagrange v0.13
Those who follow the Gemini mailing list may have noticed a message or two about IDNs and IRIs. This is the first time I'm taking a deeper look at this stuff, so here is what I've learned.
i18n
When it comes to Internationalized Domain Names, I have been blissfully unaware that it basically relies on a kludge that requires applying a complicated, special encoding to convert Unicode domains to a small-ish ASCII representation. Well, RFC 3492 is 17 years old so this is surely something that happens under the hood, a minor implementation detail in the OS? Alas, internationalization has been left to the application layer to worry about, so it needs to be handled manually.
Since Gemini allows UTF-8 encoded URLs, implementing RFC 3492 is virtually a requirement. Otherwise, one cannot make DNS lookups if the domain name contains non-ASCII characters.
As to the rest of the URL, the story is a bit simpler: normalization and escaping reserved characters. The former is needed because Unicode has multiple ways to represent the same character. Applications that deal with UTF-8 already need to use some sort of a Unicode library to actually conform to the standard. Such a library should have routines for normalization so that's one problem that's easy to deal with. (Lagrange uses GNU libunistring.) The other issue is handled by percent-encoding reserved characters, which is also straightforward.
All these encodings and translations should happen automatically and transparently.
Have some URLs with ❤️
Lagrange v0.13 embraces Unicode in both domain names and URL paths:
- In the user interface, Unicode characters are shown wherever URLs are displayed: the URL bar, history, bookmark editor, etc.
- You can disable URL decoding with a new setting in Preferences. This will show you all non-ASCII characters as percent-encoded UTF-8 (as was done in prior versions).
- The full URL is NFC normalized before sending it to a server.
- Domain names with non-ASCII characters are encoded to Punycode before doing a DNS lookup. The Punycode version of the domain name is sent to the server in the request URL, and also used for verifying the server certificate.
- Paths are percent-encoded as usual before sending requests to a server.
Text rendering
Speaking of Unicode, actually rendering it on screen is not straightforward at all. Lagrange uses custom text rendering routines that currently only support left-to-right text. A small number of special Unicode codepoints are recognized and handled (such as soft hyphens) but many are just ignored, for example variation selectors.
Version 0.13 has a bunch of improvements for text rendering:
- There is a new monospace font (Iosevka) that has a more retro/terminal-like design and improved Unicode coverage compared to Fira Mono. It is also a bit more compact, allowing more content to fit horizontally.
- When Emojis are used in monospace text, the spacing is relaxed a bit so wide Emojis don't overlap each other. The original spacing is restored after whitespace so text stays aligned.
- Unavailable Emoji variants (e.g., color) fall back to the available ones. Currently Lagrange uses a monochrome Emoji font.
- I made further tweaks to clean up box-drawing and other full-height characters. Previously, depending on text scaling, consecutive lines may have overlapped by one pixel or had a a gap between the lines.
📅 2020-12-13
CC-BY-SA 4.0