Extensible Markup Language (XML) Tutorial

Character Sets

A character set determines which characters are allowed within a document. A restrictive character set only allows certain types of characters. For example, a restrictive character set may only allow uppercase characters, and as its name suggests, a broad character set allows many characters. For example, a broad character set may include Arabic characters.

ASCII

ASCII is a widely used character set. Each character in the ASCII character set is represented by a character encoding value. The ASCII character code for an uppercase "A" is the value 65, and the ASCII character code for a lowercase "a" is the value 97. Pure ASCII is a 7-bit encoding scheme, allowing 128 different values. ANSI extend the ASCII character set to 8-bit to use the full range of 256 characters available in a Byte.

Unicode

The designated character set for XML documents is unicode, which includes characters from around the world. The Universal Character Set (UCS), is an ISO standard that encompasses most of the world's writing systems. UCS uses multi-octet characters with are not compatible many current applications and protocols. The UCS Transformation Formats (UTF) standards were developed to overcome the compatibility issue. The two most widely used encoding schemes for unicode are UTF-8, and UTF-16. UTF-8 uses 8 bits, and is compatible with 7-bit ASCII. UTF-8 is able to represent other characters using two or more byte combinations. UTF-16 uses 16 bit character encoding, and is able to represent 65,356 possible values.

Specifying a Character Set

The markup and the character data for the actual text of the document are both written in unicode by default. This enables XML documents to be created from plain text editors.

The XML declaration may optionally include the character encoding to be used. This allows you to specify an encoding type, other than 8-bit UTF. Notepad for Windows in the UK uses windows-1252 encoding by default. As not all XML parsers understand windows-1252 encoding, it is better to use a standard encoding of ISO-8859-1, which is similar to the encoding used by Notepad. Notepad for Windows 2000 and XP has the ability to save documents in unicode, allowing the encoding attribute to be omitted from the declaration. The following example specifies an encoding of ISO-8859-1.

<?xml version="1.0" encoding="ISO-8859-1"?>

You might also like...

Comments

About the author

Gez Lemon United Kingdom

I'm available for contract work. Please visit Juicify for details.

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The greatest performance improvement of all is when a system goes from not-working to working.” - John Ousterhout