.NET
- C#
- VB.NET
- F#
- Azure
- ASP.NET
  - ASP.NET AJAX
  - ASP.NET MVC
- LINQ
- ADO.NET
Java
Open Source
Mobile
Database
Architecture
RIA & Web
Toolbox

Unicode and .NET

21 Jul 2005 | by Jon Skeet | Filed in

Comments
PDF

Binary and text

Binary and text - a big distinction

Most modern computer languages (and some older ones) make a big distinction between "binary" content and "character" (or "text") content . The difference is largely the same as the instinctive one, but for the purposes of clarity, I'll define it here as:

Binary content is a sequence of octets (bytes in common parlance) with no intrinsic meaning attached. Even though there may be external means of understanding a piece of binary content to be, say, a picture, or an executable file, the content itself is just a sequence of bytes. (Note for pedantic readers: from now on, I won't use the word "octet". I'll use "byte" instead, even though strictly speaking a byte needn't be an octet. There have been architectures with 9-bit bytes, for instance. I don't believe that's a particularly relevant or useful distinction to make in this day and age, and readers are likely to be more comfortable with the word "byte".)
Character content is a sequence of characters.

The Unicode Glossary defines a character as:

The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader's understanding.
Synonym for abstract character. (See Definition D3 in Section 3.3, Characters and Coded Representations .)
The basic unit of encoding for the Unicode character encoding.
The English name for the ideographic written elements of Chinese origin. (See ideograph (2).)

That may or may not be a terribly useful definition to you, but for the most part you can again use your instinctive understanding - a character is something like "the capital letter A", "the digit 1" etc. There are other characters which are less obvious, such as: combining characters such as "an acute accent", control characters such as "newline", and formatting characters (invisible, but affect surrounding characters). The important thing is that these are fundamentally "text" in some form or other. They have some meaning attached to them.

Now, unfortunately in the past, this distinction has been very blurred - C programmers are often used to thinking of "byte" and "char" as being interchangeable, to the extent that they will talk about reading a certain number of characters, even when the content is entirely binary. In modern environments such as .NET and Java, where the distinction is clear and present in the IO libraries, this can lead to people attempting to copy binary files by reading and writing characters, resulting in corrupt output.

Where does Unicode come in?

The Unicode Consortium is a body trying to standardise the handling of character data, including its transformation to and from binary form (otherwise known as encoding and decoding). There is also a set of ISO standards (10646 in various versions) which do similar things; Unicode and ISO 10646 can largely be regarded as "the same thing" in that they are compatible in almost all respects. (In theory ISO 10646 defines a larger potential set of characters, but this is never likely to become an issue.) Most modern computer languages and environments, such as .NET and Java, use Unicode for character representations. Unicode defines, amongst other things, an abstract character repertoire (the set of characters it covers), a coded character set (a mapping from each character in the repertoire to a non-negative integer), some character encoding forms (mappings from the non-negative integers in the coded character set to sequences of "code units" (eg bytes)), and some character encoding schemes (mappings from sequences of code units into a serialized byte sequences). The difference between a character encoding form and a character encoding scheme is slightly subtle, but takes account of things like endianness. (For instance, the UCS-2 code unit sequence 0xc2 0xa9 may be serialized as 0xc2 0xa9 or 0xa9 0xc2, and it's the character encoding scheme that decides that.)

The Unicode abstract character repertoire can, in theory, hold up to 1114112 characters, although many are reserved to be invalid and the rest aren't all likely to ever be assigned. Each character is coded as an integer between 0 and 1114111 (0x10ffff). For instance, capital A is coded as 65. Until a few years ago, it was hoped that only characters in the range 0 to 2^16-1 would be required, which would have meant that each character would only have required 2 bytes to be represented. Unfortunately, more characters were needed, surrogate pairs were introduced. They confuse things significantly (at least, they confuse me significantly) and most of the rest of this page will ignore their existence - I'll cover them briefly in the "nasty bits" section.

You might also like...

Comments

About the author

Jon Skeet

C# MVP currently living in Reading and working for Google.

www.pobox.com

Interested in writing for us? Find out more.

.NET tutorials

.NET books

Expert WCF 4: SOA 2.0 with Windows Communication Foundation 4

Windows Communication Foundation has become an integral part of many .NET based solutions, enabling highly customizable messaging across distributed environments. In Expert WCF 4, you will cover scenarios that include designing, implementing, consumi...

.NET forum discussion

edmonton female escort services near me

by canadapleasure (0 replies)
USB Drive Activity Logger!

by coque0912 (7 replies)
Which is harder to learn Java or C++ ?

by surbhinahta (114 replies)
C ++ public int __cdecl printf (const char * __restrict__ _Format, ...) problem.

by sgameyta (0 replies)
Bagaimana memenangkan $ 1,54 miliar dalam Mega Jutaan

by gametogelan (0 replies)

.NET podcasts

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

20 mars 2013 (Ãmission #0157) ::.Louis-Philippe Pinsonneault: Le "App store" de Windows 8Nous discutons avec Louis-Philippe Pinsonneault du "App store" de Windows 8. Nous essaieront de couvrir tout ce quâil y a Ã savoir sur le "App store" : les types de licences, les modÃ¨les de reven.

.NET jobs

Web Systems Developer

Red Gate Software in Cambridge, United Kingdom
45,000
Web Application Developer

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year
Senior Software Engineer

@ One Limited in London, United Kingdom
Jr. .NET Developer

T-Symmetry in Lakewood, United States

Managed hosting by Everycity

Unicode and .NET

Binary and text

Binary and text - a big distinction

Where does Unicode come in?

You might also like...

Comments

About the author

Jon Skeet

.NET tutorials

.NET books

Expert WCF 4: SOA 2.0 with Windows Communication Foundation 4

.NET forum discussion

edmonton female escort services near me

by canadapleasure (0 replies)

USB Drive Activity Logger!

by coque0912 (7 replies)

Which is harder to learn Java or C++ ?

by surbhinahta (114 replies)

C ++ public int cdecl printf (const char * restrict__ _Format, ...) problem.

by sgameyta (0 replies)

Bagaimana memenangkan $ 1,54 miliar dalam Mega Jutaan

by gametogelan (0 replies)

.NET podcasts

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

.NET jobs

Web Systems Developer

Red Gate Software in Cambridge, United Kingdom
45,000

Web Application Developer

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year

Senior Software Engineer

@ One Limited in London, United Kingdom

Jr. .NET Developer

T-Symmetry in Lakewood, United States

Contribute

Web Development

Developer Jobs

Our tools

Unicode and .NET

Binary and text

Binary and text - a big distinction

Where does Unicode come in?

You might also like...

Comments

About the author

by canadapleasure (0 replies)

by coque0912 (7 replies)

by surbhinahta (114 replies)

by sgameyta (0 replies)

by gametogelan (0 replies)

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

Red Gate Software in Cambridge, United Kingdom 45,000

Red Gate Software in Cambridge, United Kingdom £35,000-45,000 GBP per year

@ One Limited in London, United Kingdom

T-Symmetry in Lakewood, United States

Contribute

Web Development

Developer Jobs

Our tools

Red Gate Software in Cambridge, United Kingdom
45,000

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year