C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?
edit: yes i know how to decode utf8 manually, my query is about the stl specifically
@luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?
@mikebabcock Those are code units
@krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!
@mikebabcock Quick guide to Unicode terminology:
- code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
- codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
- graphemes: the smallest functional units of a script, formed from one or more codepoints
- grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.
@krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.
@dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.
@krans @dalias aside, this is just more proof that the terminology needs revision.
The fact that Unicode is just a numbered list of possible visual language thingies that can sometimes be combined to make other logical language thingies and those numbers can be encoded in a bunch of different ways is already complex enough for most people.
(Never mind 16+ bit encodings having endian issues)
@mikebabcock Human language is complex. As far as I can, most of the complexity in Unicode arises from scripts being inherently complex; the remainder is due to providing a migration path from older encoding forms.
I haven't found any complexity in Unicode that I didn't (grudgingly) agree was necessary, apart from emoji…