Unicode shenanigans

How does the gemini protocol handle unicode zero-width emoji manipulation?

😀​‌​​‌​​​​‌‌​‌‌‌‌​‌‌‌​‌‌‌​​‌​​​​​​‌‌​​‌​​​‌‌​‌‌‌‌​‌‌​​‌​‌​‌‌‌​​‌‌​​‌​​​​​​‌‌‌​‌​​​‌‌​‌​​​​‌‌​​‌​‌​​‌​​​​​​‌‌‌​​​​​‌‌‌​​‌​​‌‌​‌‌‌‌​‌‌‌​‌​​​‌‌​‌‌‌‌​‌‌​​​‌‌​‌‌​‌‌‌‌​‌‌​‌‌​​​​‌​​​​​​‌‌​‌​​​​‌‌​​​​‌​‌‌​‌‌‌​​‌‌​​‌​​​‌‌​‌‌​​​‌‌​​‌​‌​​‌​​​​​​‌‌‌​‌​​​‌‌​‌​​​​‌‌​‌​​‌​‌‌‌​​‌‌​​‌​​​​​​‌‌​​‌​‌​‌‌​‌‌​‌​‌‌​‌‌‌‌​‌‌​‌​‌​​‌‌​‌​​‌​​‌‌‌‌‌‌

This emoji above contains information. Can you decode it?

Does gemini preserve it? Does BBS preserve it?

#gemini #software #tech

🚀 LucasMW

Oct 06 · 2 months ago

9 Comments ↓

🚀 stack · Oct 06 at 13:39:

Unicode was a terrible mistake. Localism makes more sense than globalism in every way. ASCII and one extra code page for local language would suffice. We are minimalists, right?

🚀 LucasMW [OP] · Oct 06 at 13:44:

Depending on the local language, it would not.

Regardless, unicode seems like both a way to have programs interface with all languages, and a bug & vulnerability fountain

🦎 bluesman · Oct 06 at 14:01:

Alhena detects a single emoji ('\uD83D\uDE00') on a 322 byte line. That emoji is displayed as either a color sprite or a monochrome font depending on preferences. The remaining 320 bytes - your shenanigans, I assume - is passed to the rendering component where it gets ignored. So yes, I'd say the bytes are preserved in BBS but not rendered (at least in Alhena).

That said, Alhena will display proper ZWJ emojis if set to use color sprites:

🦹🏻‍♂️

🚀 LucasMW [OP] · Oct 06 at 14:27:

@bluesman Thanks for the testing. I am experimenting with unicode and learning new things every day!

🕹️ skyjake [...] · Oct 06 at 15:00:
How does the gemini protocol handle unicode zero-width emoji manipulation?

The Gemini protocol just transports the response contents as-is. If you are using "text/gemini;charset=utf-8" (like virtually everyone is), then it's just regular UTF-8 text and the client will attempt to render it.

🦂 zzo38 · Oct 06 at 20:13:

I agree that Unicode was (and is) a mistake, although not due to localism and globalism. One character set cannot be suitable for all uses. Sometimes it is useful to have multiple languages and writing systems together, but even in the cases where that is appropriate, Unicode is not a good way to do it.

🚀 stack · Oct 06 at 20:25:

I did not mean globalism in the political sense. It's just too many god damned codepoints with too many meanings, whereas 99.999% of the time you just need some 8-bit text. In the meantime you have a bloated display architecture, no good way to pre-render character sets without weird caching techiniques, and a loss of ability to count characters by counting bytes.

Also, weird ways to scam the users with similar-looking characters that would never be needed in real life.

🦂 zzo38 · Oct 06 at 22:04:

I agree with you, those are some of the problems with Unicode, although there are others as well. I write programs (and file formats, protocols, etc) that do not use Unicode (although many programs/etc won't and shouldn't care what character set you are using).

👽 TKurtBond · Oct 09 at 21:17:

I use Unicode regulary in text that I compose (I do it in Emacs, and have my own keyboard shortcuts for the characters that I use). It is more convenient than ASCII or Latin 1, etc for me. I agree it is horribly complicated, and wish there was a better way to do things, but it works for me, whether I'm writing reStructuredText, Markdown, Troff/Groff, LaTeX or ConTeXt. I regularly use characters from outside any 8 bit set of codes, and I'm not a heavy user of foreign languages, but I do use some, mostly names and occasional quotes. But there are people who regularly write documents with multiple languages, and 8 bit codes are too limited for them.