Unicode and .NET

What does .NET provide?

What does .NET provide?

If all of this sounds rather confusing, don't worry. It's worth being aware of the distinctions above, but they don't often actually come to the fore. Most of the time you just want to convert some bytes into some characters, and vice versa. This is where the System.Text.Encoding class comes in, along with the System.Char structure (aka char in C#) and the System.String class (aka string in C#).

The char is the most basic character type. Each char is a single Unicode character. It takes 2 bytes in memory, and can take a value of 0-65535. Note that not all values are thus actually valid Unicode characters.

A string is just a sequence of chars, fundamentally. It's immutable, which means that once you've created a string instance (however you've done it) you can't change it - the various methods in the string class which suggest that they're changing the string in fact just return a new string which is the original character sequence with the changes applied.

The System.Text.Encoding class provides facilities for converting arrays of bytes to arrays of characters, or strings, and vice versa. The class itself is abstract; various implementations are provided by .NET and can easily be instantiated, and users can write their own derived classes if they wish. (This is quite a rare requirement, however - most of the time you'll be fine with the built-in implementations.) An encoding can also provide separate encoders and decoders, which maintain state between calls. This is necessary for multi-byte character encoding schemes, where you may not be able to decode all the bytes you have so far received from a stream. For instance, if a UTF-8 decoder receives 0x41 0xc2, it can return the first character (a capital A) but must wait for the third byte to determine what the second character is.

Built-in encoding schemes

.NET provides various encoding schemes "out of the box". What follows below is a description (as far as I can find) of the various different encoding schemes, and how they can be retrieved.

ASCII

ASCII is one of the most commonly known and frequently misunderstood character encodings. Contrary to popular belief, it is only 7 bit - there are no ASCII characters above 127. If anyone says that they wish to encode (for example) "ASCII 154" they may well not know exactly which encoding they actually mean. If pressed, they're likely to say it's "extended ASCII". There is no encoding scheme called "extended ASCII". There are many 8-bit encodings which are supersets of ASCII, and usually it is one of these which is meant - commonly whatever Windows Code Page is the default for their computer. Every ASCII character has the same value in the ASCII encoded as in the Unicode coded character set - in other words, ASCII x is the same character as Unicode x for all characters within ASCII. The .NET ASCIIEncoding class (an instance of which can be easily retrieved using the Encoding.ASCII property) is slightly odd, in my view, as it appears to encode by merely stripping away all bits above the bottom 7. This means that, for instance, Unicode character 0xb5 ("micro sign") after encoding and decoding would become Unicode 0x35 ("digit five"), rather than some character showing that it was the result of encoding a character not contained within ASCII.

UTF-8

UTF-8 is a good general-purpose way of representing Unicode characters. Each character is encoded as a sequence of 1-4 bytes. (All the characters < 65536 are encoded in 1-3 bytes; I haven't checked whether .NET encodes surrogates as two sequences of 1-3 bytes, or as one sequence of 4 bytes). It can represent all characters, it is "ASCII-compatible" in that any sequence of characters in the ASCII set is encoded in UTF-8 to exactly the same sequence of bytes as it would be in ASCII. In addition, the first byte is sufficient to say how many additional bytes (if any) are required for the whole character to be decoded. UTF-8 itself needs no byte-ordering mark (BOM) although it could be used as a way of giving evidence that the file is indeed in UTF-8 format. The UTF-8 encoded BOM is always 0xef 0xbb 0xbf. Obtaining a UTF-8 encoding in .NET is simple - use the Encoding.UTF8 property. In fact, a lot of the time you don't even need to do that - many classes (such as StreamWriter) used UTF-8 by default when no encoding is specified. (Don't be misled by Encoding.Default - that's something else entirely!) I suggest always specifying the encoding however, just for the sake of readability.

UTF-16 and UCS-2

UTF-16 is effectively how characters are maintained internally in .NET. Each character is encoded as a sequence of 2 bytes, other than surrogates which take 4 bytes. The opportunity of using surrogates is the only difference between UTF-16 and UCS-2 (also known as just "Unicode"), the latter of which can only represent characters 0-0xffff. UTF-16 can be big-endian, little-endian, or machine-dependent with optional BOM (0xff 0xfe for little-endianness, and 0xfe 0xff for big-endianness). In .NET itself, I believe the surrogate issues are effectively forgotten, and each value in the surrogate pair is treated as an individual character, making UCS-2 and UTF-16 "the same" in a fuzzy sort of way. (The exact differences between UCS-2 and UTF-16 rely on deeper understanding of surrogates than I have, I'm afraid - if you need to know details of the differences, chances are you'll know more than I do anyway.) A big-endian encoding may be retrieved using Encoding.BigEndianUnicode, and a little-endian encoding may be retrieved using Encoding.Unicode. Both are instances of System.Text.UnicodeEncoding, which can also be constructed directly with appropriate parameters for whether or not to emit the BOM and which endianness to use when encoding. I believe (although I haven't tested) that when decoding binary content, a BOM in the content overrides the endianness of the encoder, so the programmer doesn't need to do any extra work to decode appropriately if they either know the endianness or the content contains a BOM.

UTF-7

UTF-7 is rarely used, in my experience, but encodes Unicode (possibly only the first 65535 characters) entirely into ASCII characters (not bytes!). This can be useful for mail where the mail gateway may only support ASCII characters, or some subset of ASCII (in, for example, the EBCDIC encoding). This description sounds fairly woolly for a reason: I haven't looked into it in any detail, and don't intend to. If you need to use it, you'll probably understand it reasonably well anyway, and if you don't absolutely have to use it, I'd suggest steering clear. An encoding instance in .NET can be retrieved using Encoding.UTF7

Windows/ANSI Code Pages

Windows Code Pages are usually either single or double byte character sets, encoding up to 256 or 65536 characters respectively. Each is numbered, an encoding for a known code page number can be retrieved using Encoding.GetEncoding(int). Code pages are mostly useful for legacy data which is often stored in the "default code page". An encoding for the default code page can be retrieved using Encoding.Default. Again, I try to avoid using code pages where possible. More information is available in the MSDN.

ISO-8859-1 (Latin-1)

Like ASCII, every character in Latin-1 has the same code there as in Unicode. I haven't been able to ascertain for certain whether or not Latin-1 has a "hole" of undefined characters from 128 to 159, or whether it contains the same control characters there that Unicode does. Latin-1 is also code page 28591, so obtaining an encoding for it is simple: Encoding.GetEncoding (28591).

Streams, readers and writers

Streams are by their nature binary - they read and write bytes, fundamentally. Anything which takes a string is going to do some kind of conversion to bytes, which may or may not be what you want. The equivalents of streams for reading and writing text are System.IO.TextReader and System.IO.TextWriter respectively. If you have a stream already, you can use System.IO.StreamReader (which derives from TextReader) and System.IO.StreamWriter (which derives from TextWriter) respectively, constructing them with the stream and the encoding you wish to use. If you don't specify the encoding, UTF-8 is assumed. Here is some example code to convert a file from UTF-8 to UCS-2:


using System;
using System.IO;
using System.Text;
public class FileConverter
{
    const int BufferSize = 8096;
   
    public static void Main(string[] args)
    {
        if (args.Length != 2)
        {
            Console.WriteLine
                ("Usage: FileConverter <input file> <output file>");
            return;
        }
       
        // Open a TextReader for the appropriate file
        using (TextReader input = new StreamReader
              (new FileStream (args[0], FileMode.Open),
                Encoding.UTF8))
        {
            // Open a TextWriter for the appropriate file
            using (TextWriter output = new StreamWriter
                  (new FileStream (args[1], FileMode.Create),
                    Encoding.Unicode))
            {
                // Create the buffer
                char[] buffer = new char[BufferSize];
                int len;
               
                // Repeatedly copy data until we've finished
                while ( (len = input.Read (buffer, 0, BufferSize)) > 0)
                {
                    output.Write (buffer, 0, len);
                }
            }
        }
    }
}

Note that this demonstrates using the constructors for TextReader and TextWriter which take streams. There are also constructors which take filenames as parameters, so that you don't have to manually open a FileStream in your code. Other parameters, such as the buffer size and whether or not to detect a BOM if present, are available - see the documentation for more details.

You might also like...

Comments

About the author

Jon Skeet United Kingdom

C# MVP currently living in Reading and working for Google.

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“If Java had true garbage collection, most programs would delete themselves upon execution.” - Robert Sewell