Some Misc Notes

Unicode Terminology

Some Unicode terminology I had to look up. Sources:

This info is all over the internet, but here’s my summary:

Character
This term tends to be overloaded. It usually means either a grapheme, code point, or glyph.
Code Point
The number representing a given unicode character/symbol. Written like “U+12ab” (that’s in hex). Code points go from U+ffff to U+0010ffff. Unicode is split into 17 “planes” of code points; U+0000 to U+ffff is in the first plane (0) and is called the “basic multilingual plane” (BMP). Code points also have a long unicode character name (for example, ψ is “GREEK SMALL LETTER PSI”.
Character Encoding
How you go between code points and bytes. UTF-8 is the standard, but you’ll also see others (including utf-16).
Code Unit
the unit of storage for a given character encoding. In utf-8 this is 8 bits. In utf-16 it’s 16 bits. Storing a string takes up code units on disk or in memory. In utf-8, code points map to one, two, three, or four code units.
Grapheme
the thing that’s displayed as a single graphical character. May consist of one or more code points.
Glyph
a graphical image stored in a font, one or more of which represent a grapheme.
Surrogate Pair
Since utf-16 is 2 bytes, to represent code points above U+FFFF you need an extra code unit. The two code units make up a “surrogate pair”.

Notes:

  • sounds like the grapheme for something like “ä” can be represented either by the older style (legacy) single-codepoint U+00e4, or else as “a” plus the combining diaeresis U+0308.