Someone posted an interesting problem the other day: turn 'áäåãòä' into 'AAAAOA'. It doesn't seem too complicated, but nothing
that I could think of really solved the problem in an elegant way.
Now, I know Unicode pretty well - I've implemented
Stringprep as part of the
SoapBox Platform's XMPP Implementation.
Stringprep does all sorts of interesting things such as bidirectional checks, IDN,
Punycode, and Case Folding, and
Form KC Normalization. Still though, nothing came to mind as a solution other than using
a crazy table based lookup scheme.
Then I saw a post by JR that suggesting using
Unicode Normalization Form KD (NFKD) to decompose the string and the light bulbs went off! (He gets
all the credit for this idea - I just implemented it in .Net and wrote it up here).
Normalization Form KD (commonly written as NFKD) will decompose composite characters into their components forms. For example the
character 'ẛ' is actually Unicode codepoint 1E9B (written as: U+1E9B) and is named, "LATIN SMALL LETTER LONG S WITH DOT ABOVE". When
we apply NFKD, this codepoint decomposes into two codepoints: U+0073 and U+0307. You may
recognize U+0073 - it's actually the same as the ASCII code 0x73, which is 's'.
Now that we know Normalization form KD will decompose our string into something more basic, we
just need to figure out how to do this in .Net. In .Net 1.0 & 1.1, Microsoft didn' include any deep
Unicode support. Fortunatly much of this has been corrected in .Net 2.0, and now normalizing a string is a
breeze:
string s = "áäåãòä";
string normalized = s.Normalize(NormalizationForm.FormKD);
Now we have a Form KC Normalized string. If we display the string, say, in a MessageBox, it looks the
same, but looks are deceiving. Under the hood, it's fully decomposed. A more full discussion of this would get into
topics such as Combining Characters
and Text Elements. More information for .Net developers can be found
under String Indexing.
.Net programmers know that all .Net strings are UTF-16 encoded. This means our nice decomposed string
is sitting in a format that (at this point) isn't really what we want - it's stored as a sequence of
combining characters encoded in UTF-16.
At this point, my original blog entry on this topic used the ASCII encoder to rip out the un-wanted
marks in what was a pretty clever way. While clever, it was pointed out to me by
Michael Kaplan what a fool I was being for
taking this approach. He was polite about it, and while I hate being called a fool, he was sure right! Pivoting through
the ASCII encoder was just silly. Ah well. Mistakes & public humiliation are how we learn...
The correct way to clean things up from here is shown by Michael Kaplan's blog entry
entitled Stripping diacritics.
As Michael points out, the proper way remove the marks is to iterate over the characters in the string, and use the
CharUnicodeInfo class
to determine if they're Non-Spacing Marks.
If they are, they they're skipped - if they are not, they they're appended into our new string. The resulting string
has the right results in it - unlike my original solution which only worked for ASCII characters that were marked.
StringBuilder sb = new StringBuilder();
for (int i = 0; i < normalized.Length; i++)
{
char c = normalized[i];
UnicodeCategory uc =
CharUnicodeInfo.GetUnicodeCategory(c);
if (uc != UnicodeCategory.NonSpacingMark)
sb.Append(c);
}
The complete source code looks like:
string s = "áäåãòä";
string normalized = s.Normalize(NormalizationForm.FormKD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < normalized.Length; i++)
{
char c = normalized[i];
UnicodeCategory uc =
CharUnicodeInfo.GetUnicodeCategory(c);
if (uc != UnicodeCategory.NonSpacingMark)
sb.Append(c);
}
MessageBox.Show(sb.ToString());
Whenever I drop into doing Unicode related tasks, I'm always amazed at the sheer bredth
of the Unicode standard. There is so much information in there, and so many powerfull features
that it's easy to quickly become overwhelmed.
It's easy too to forget that everthing we do these days on a computer is leveraging Unicode. Prettymuch everything is
encoded in either UTF-8 or UTF-16 - all web pages, all XML documents, all text files stored on your hard disk. Unicode
is at the heart of Windows, Linux, .Net & Java. Despite this, very few developers have any real understanding of what
Unicode is, or how it works. I've been asking 'What does that UTF-8 or UTF-16 mean that you've typed in a zillion
times?" during interviews now for years, and have yet to ever get back the right answer (although I've sure had
some creative responses!).
Technorati Tags:
XMPP,
SoapBox,
Microsoft MVP,
.Net,
Unicode
My Technorati Profile
I guess the old adage 'The best way to get the right answer on Usenet is to post the wrong one.' is still alive & true and applies equally well to the blog world...