Notes about Unicode effort

Today, we have a Unicode and CombinedCharacter class in Pharo, and there is different but similar Unicode code in Squeak. These are too simple (even though they might work, partially).
The scope of the original threads is way too wide: a new string type, normalisation, collation, being cross dialect, mixing all kinds of character and encoding definitions. All interesting, but not much will come out of it. But the point that we cannot leave proper text string handling to an outside library is indeed key.

That is why a couple of people in the Pharo community (myself included) started an experimental, proof of concept, prototype project, that aims to improve Unicode support. We will announce it to a wider public when we feel we have something to show for. The goal is in the first place to understand and implement the fundamental algorithms, starting with the 4 forms of Normalisation. But we’re working on collation/sorting too.
This work is of course being done for/in Pharo, using some of the facilities only available there. It probably won’t be difficult to port, but we can’t be bothered with probability right now.
What we started with is loading UCD data and making it available as a nice objects (30.000 of them).
So now you can do things like
$é unicodeCharacterData.
$é unicodeCharacterData uppercase asCharacter.
 => “$É”

$é unicodeCharacterData decompositionMapping.
 => “#(101 769)”
There is also a cool GT Inspector view:
Next we started implementing a normaliser. It was rather easy to get support for simpler languages going. The next code snippets use explicit code arrays, because copying decomposed diacritics to my mail client does not work (they get automatically composed), in a Pharo Workspace this does work nicely with plain strings. The higher numbers are the diacritics.
(normalizer decomposeString: ‘les élèves Français’) collect: #codePoint as: Array.

 => “#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 807 97 105 115)”

(normalizer decomposeString: ‘Düsseldorf Königsallee’) collect: #codePoint as: Array.

 => “#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 115 97 108 108 101 101)”

normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String).

 => “‘les élèves Français'”

normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: String).
 => “‘Düsseldorf Königsallee'”

However, the real algorithm following the official specification (and other elements of Unicode that interact with it) is way more complicated (think about all those special languages/scripts out there). We’re focused on understanding/implementing that now.
Next, unit tests were added (of course). As well as a test that uses to run about 75.000 individual test cases to check conformance to the official Unicode Normalization specification.
Right now (with super cool hangul / jamo code by Henrik), we hit the following stats:
#testNFC 16998/18593 (91.42%)
#testNFD 16797/18593 (90.34%)
#testNFKC 13321/18593 (71.65%)
#testNFKD 16564/18593 (89.09%)
Way better than the naive implementations, but not yet there.
We are also experimenting and thinking a lot about how to best implement all this, trying out different models/ideas/apis/representations.
It will move slowly, but you will hear from us again in the coming weeks/months.
PS: Pharo developers with a good understanding of this subject area that want to help, let me know and we’ll put you in the loop. Hacking and specification reading are required 😉

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: