Unicode and .NET

Difficult Bits

Difficult bits

Okay, so those are the basics of Unicode. There are then lots of extra bits, some of which have already been hinted at, and which people ought to be aware of, even if they deem them too unlikely to be relevant for their application to be worth sorting out. I don't offer any general techniques or guiding principles here - I'm just trying to raise some awareness. This is by no means an exhaustive list, either - these are just some of the nasty bits. It's important to recognise that a lot of the difficulty here is in no way the fault of the Unicode Consortium - just as with dates and times and any number of other internationalisation problems, humanity has got itself into a fundamentally tricky situation over the course of its history.

Culture-sensitive searching and casing

These are covered in my article on .NET string handling.

Surrogate pairs

Now that Unicode has more than 65536 characters, it can't be represented in two bytes. This means that a .NET char value can't store all possible values. The solution UTF-16 uses is that of surrogate pairs: pairs of 16-bit values where each value is between 0xd8000 and 0xdfff. In other words, two "sort of" characters make one "real" character. (UCS-4 and UTF-32 get round this problem entirely by having wider values to start with - when everything's four bytes, you can get all possible characters in.) This is basically a headache - it means that a string of 10 chars can actually represent anywhere between 5 and 10 "real" Unicode characters. Fortunately, most applications which don't involve scientific/mathematical notation and Han characters are unlikely to need to worry too much about them. Whether or not that applies to you is a different matter - and exactly which bits of your code are sensitive to surrogates will also vary between applications.

Combining characters

Not all characters should result in a single character being drawn on the screen. An accented character can be represented as the unaccented character followed by the accented combining character. Some GUI systems will support combining characters, some won't - and the impact on your application will depend on what assumptions you're making.

Normalization

Partly due to things like combining characters, there can be several ways of representing what is in some senses a single character. Character sequences can be normalised to use combining characters wherever possible, or to avoid using combining characters wherever possible. Should your application treat two different sequences representing the same actual character as different or the same? Do any components you need rely on sequences being normalized in one particular way?

Debugging Unicode Problems

This page describes what to do in a very specific situation. Namely, you've got some character data in one place (typically a database) which has to go through various steps and then ends up being shown to the user (often on a web page). Unfortunately, some characters aren't being displayed correctly. Due to the many steps involved, the problem can occur in various places. This page aims to help you find out what's wrong simply and reliably.

Step 1: Understand the basics of Unicode

If you feel comfortable with Unicode, character encodings etc, feel free to skip this step. Basically, you need to know a little bit about what characters are and what conversions are likely to be applied to them before going much further.

Step 2: Try to identify the possible conversions involved

If you can work out where things might be going wrong, it's much easier to then isolate which one it is. Also bear in mind not just how you're retrieving the data, but how the data got there in the first place. (Some problems I've seen have been due to an old application writing to and reading from the database in an incorrect way, but the bugs cancelling each other out. No problems occur when it's just this broken application which accesses the database, but things go wrong when anything else does.) Steps involved may well include fetching the data from the database, reading it from a file, sending it across a web connection, or displaying it on the screen.

Step 3: Verify the data at each step

The first lesson here is not to trust anything which tries to log the character data as a sequence of glyphs. Instead, you should log the character data as a sequence of Unicode values (integers). For instance, if I had a string containing the word "hello", I would display it as "0068 0065 006c 006c 006f". (Using hex makes it easier to check values against the Unicode code charts later.) To achieve this, step through each character in the string and display the character however you would display an integer. For instance, here is a method to dump all the characters in a string to the console:

static void DumpString (string value)
{
    foreach (char c in value)
    {
        Console.Write ("{0:x4} ", (int)c);
    }
    Console.WriteLine();
}   

Depending on your exact environment, your method of logging will vary, but using something like the above should give you what you need.

The reason for doing this is that it gets rid of problems with fonts, other encoding issues, etc. If you can't log even plain ASCII hex digits properly, you're in a world of trouble anyway - but you may well not be able to log Unicode in a reliable way, and as you already know you've got some problems on the Unicode front, it's worth being safe.

Now you need to make sure there's a test case to use. Find some (preferrably small) example of where your application is failing, make sure you know exactly what the result should be, and then log the actual result at each of your possible problem points. (Some may be out of your control, but usually if you log as soon as you receive some data and just before you send some data, you'll find the problem.)

Having logged a problematic string, you should verify whether or not it's what it should be. This is where the Unicode code charts page comes in. You can either pick which block you believe the correct character is in, or you can search for your character alphabetically. Check that each character in the string has its proper Unicode value. As soon as you find a point in your application flow where the character data is corrupted, you should investigate that area of the code, find out why it's being corrupted and fix it. When you've got it right throughout the application flow, the application should be working properly.

You might also like...

Comments

About the author

Jon Skeet United Kingdom

C# MVP currently living in Reading and working for Google.

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“To iterate is human, to recurse divine” - L. Peter Deutsch