Wait, what?
is an emoji, so it is built up from surrogate pairs, right?
NOPE! Turns out it consists of U+2764 (plain symbol) and U+FE0F (Variation Selector 16)
This is why you should use Intl.Segmenter() and just deal with its abysmal performance
@sir_pepe There's scope for improving extended grapheme cluster segmentation performance by using vector instructions.
Since the vast majority of text has one codepoint per grapheme cluster, some applications use text data structures that break text into runs of trivial and multi-CP graphemes. #Unicode
@krans Yeah, if you are shackled to JavaScript like I am, splitting the sting is the only thing that helps. Provided you know how things like work ._.
@sir_pepe I think that JS implementations should consider implementing Unicode algorithms as intrinsics. Every non-trivial JS program needs to be able to handle text robustly and fast…
@krans Exactly what I'm thinking. On the other hand, things on the web appear mostly work. I don't know how or why.