.NET
- C#
- VB.NET
- F#
- Azure
- ASP.NET
  - ASP.NET AJAX
  - ASP.NET MVC
- LINQ
- ADO.NET
Java
Open Source
Mobile
Database
Architecture
RIA & Web
Toolbox

Unicode and .NET

21 Jul 2005 | by Jon Skeet | Filed in

Comments
PDF

Difficult Bits

Difficult bits

Okay, so those are the basics of Unicode. There are then lots of extra bits, some of which have already been hinted at, and which people ought to be aware of, even if they deem them too unlikely to be relevant for their application to be worth sorting out. I don't offer any general techniques or guiding principles here - I'm just trying to raise some awareness. This is by no means an exhaustive list, either - these are just some of the nasty bits. It's important to recognise that a lot of the difficulty here is in no way the fault of the Unicode Consortium - just as with dates and times and any number of other internationalisation problems, humanity has got itself into a fundamentally tricky situation over the course of its history.

Culture-sensitive searching and casing

These are covered in my article on .NET string handling.

Surrogate pairs

Now that Unicode has more than 65536 characters, it can't be represented in two bytes. This means that a .NET char value can't store all possible values. The solution UTF-16 uses is that of surrogate pairs: pairs of 16-bit values where each value is between 0xd8000 and 0xdfff. In other words, two "sort of" characters make one "real" character. (UCS-4 and UTF-32 get round this problem entirely by having wider values to start with - when everything's four bytes, you can get all possible characters in.) This is basically a headache - it means that a string of 10 chars can actually represent anywhere between 5 and 10 "real" Unicode characters. Fortunately, most applications which don't involve scientific/mathematical notation and Han characters are unlikely to need to worry too much about them. Whether or not that applies to you is a different matter - and exactly which bits of your code are sensitive to surrogates will also vary between applications.

Combining characters

Not all characters should result in a single character being drawn on the screen. An accented character can be represented as the unaccented character followed by the accented combining character. Some GUI systems will support combining characters, some won't - and the impact on your application will depend on what assumptions you're making.

Normalization

Partly due to things like combining characters, there can be several ways of representing what is in some senses a single character. Character sequences can be normalised to use combining characters wherever possible, or to avoid using combining characters wherever possible. Should your application treat two different sequences representing the same actual character as different or the same? Do any components you need rely on sequences being normalized in one particular way?

Debugging Unicode Problems

This page describes what to do in a very specific situation. Namely, you've got some character data in one place (typically a database) which has to go through various steps and then ends up being shown to the user (often on a web page). Unfortunately, some characters aren't being displayed correctly. Due to the many steps involved, the problem can occur in various places. This page aims to help you find out what's wrong simply and reliably.

Step 1: Understand the basics of Unicode

If you feel comfortable with Unicode, character encodings etc, feel free to skip this step. Basically, you need to know a little bit about what characters are and what conversions are likely to be applied to them before going much further.

Step 2: Try to identify the possible conversions involved

If you can work out where things might be going wrong, it's much easier to then isolate which one it is. Also bear in mind not just how you're retrieving the data, but how the data got there in the first place. (Some problems I've seen have been due to an old application writing to and reading from the database in an incorrect way, but the bugs cancelling each other out. No problems occur when it's just this broken application which accesses the database, but things go wrong when anything else does.) Steps involved may well include fetching the data from the database, reading it from a file, sending it across a web connection, or displaying it on the screen.

Step 3: Verify the data at each step

The first lesson here is not to trust anything which tries to log the character data as a sequence of glyphs. Instead, you should log the character data as a sequence of Unicode values (integers). For instance, if I had a string containing the word "hello", I would display it as "0068 0065 006c 006c 006f". (Using hex makes it easier to check values against the Unicode code charts later.) To achieve this, step through each character in the string and display the character however you would display an integer. For instance, here is a method to dump all the characters in a string to the console:

  static void DumpString (string value)

  {

    foreach (char c in value)

    {

        Console.Write ("{0:x4} ", (int)c);

    }

    Console.WriteLine();

}

Depending on your exact environment, your method of logging will vary, but using something like the above should give you what you need.

The reason for doing this is that it gets rid of problems with fonts, other encoding issues, etc. If you can't log even plain ASCII hex digits properly, you're in a world of trouble anyway - but you may well not be able to log Unicode in a reliable way, and as you already know you've got some problems on the Unicode front, it's worth being safe.

Now you need to make sure there's a test case to use. Find some (preferrably small) example of where your application is failing, make sure you know exactly what the result should be, and then log the actual result at each of your possible problem points. (Some may be out of your control, but usually if you log as soon as you receive some data and just before you send some data, you'll find the problem.)

Having logged a problematic string, you should verify whether or not it's what it should be. This is where the Unicode code charts page comes in. You can either pick which block you believe the correct character is in, or you can search for your character alphabetically. Check that each character in the string has its proper Unicode value. As soon as you find a point in your application flow where the character data is corrupted, you should investigate that area of the code, find out why it's being corrupted and fix it. When you've got it right throughout the application flow, the application should be working properly.

You might also like...

Comments

About the author

Jon Skeet

C# MVP currently living in Reading and working for Google.

www.pobox.com

Interested in writing for us? Find out more.

.NET tutorials

.NET books

Expert WCF 4: SOA 2.0 with Windows Communication Foundation 4

Windows Communication Foundation has become an integral part of many .NET based solutions, enabling highly customizable messaging across distributed environments. In Expert WCF 4, you will cover scenarios that include designing, implementing, consumi...

.NET forum discussion

edmonton female escort services near me

by canadapleasure (0 replies)
USB Drive Activity Logger!

by coque0912 (7 replies)
Which is harder to learn Java or C++ ?

by surbhinahta (114 replies)
C ++ public int __cdecl printf (const char * __restrict__ _Format, ...) problem.

by sgameyta (0 replies)
Bagaimana memenangkan $ 1,54 miliar dalam Mega Jutaan

by gametogelan (0 replies)

.NET podcasts

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

20 mars 2013 (Ãmission #0157) ::.Louis-Philippe Pinsonneault: Le "App store" de Windows 8Nous discutons avec Louis-Philippe Pinsonneault du "App store" de Windows 8. Nous essaieront de couvrir tout ce quâil y a Ã savoir sur le "App store" : les types de licences, les modÃ¨les de reven.

.NET jobs

Web Systems Developer

Red Gate Software in Cambridge, United Kingdom
45,000
Web Application Developer

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year
Senior Software Engineer

@ One Limited in London, United Kingdom
Jr. .NET Developer

T-Symmetry in Lakewood, United States

Managed hosting by Everycity

Unicode and .NET

Difficult Bits

Difficult bits

Culture-sensitive searching and casing

Surrogate pairs

Combining characters

Normalization

Debugging Unicode Problems

Step 1: Understand the basics of Unicode

Step 2: Try to identify the possible conversions involved

Step 3: Verify the data at each step

You might also like...

Comments

About the author

by canadapleasure (0 replies)

by coque0912 (7 replies)

by surbhinahta (114 replies)

by sgameyta (0 replies)

by gametogelan (0 replies)

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

Red Gate Software in Cambridge, United Kingdom 45,000

Red Gate Software in Cambridge, United Kingdom £35,000-45,000 GBP per year

@ One Limited in London, United Kingdom

T-Symmetry in Lakewood, United States

Contribute

Web Development

Developer Jobs

Our tools

Red Gate Software in Cambridge, United Kingdom
45,000

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year