mastodon.me.uk is one of the many independent Mastodon servers you can use to participate in the fediverse.
Open, user-supported, corporation-free social media for the UK.

Administered by:

Server stats:

539
active users

C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?

edit: yes i know how to decode utf8 manually, my query is about the stl specifically

@luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?

@krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!

@mikebabcock Quick guide to Unicode terminology:

- code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
- codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
- graphemes: the smallest functional units of a script, formed from one or more codepoints
- grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.

@krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.

@dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.

@mikebabcock

@krans @dalias aside, this is just more proof that the terminology needs revision.
The fact that Unicode is just a numbered list of possible visual language thingies that can sometimes be combined to make other logical language thingies and those numbers can be encoded in a bunch of different ways is already complex enough for most people.
(Never mind 16+ bit encodings having endian issues)

Peter Brett

@mikebabcock Human language is complex. As far as I can, most of the complexity in Unicode arises from scripts being inherently complex; the remainder is due to providing a migration path from older encoding forms.

I haven't found any complexity in Unicode that I didn't (grudgingly) agree was necessary, apart from emoji…

@dalias