Strings in .NET and C#

Page 3 of 3
  1. Introduction
  2. Interning, Literals and the Debugger
  3. Memory Usage, Encoding and Internationalization od

Memory Usage, Encoding and Internationalization od

Memory usage

In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies. The only other classes which do this (as far as I know) are arrays. Essentially, a string is a character array in memory, plus the length of the array and the length of the string (in characters). The length of the array isn't always the same as the length in characters, as strings can be "over-allocated" within mscorlib.dll, to make building them up easier. (StringBuilder does this, for instance.) While strings are immutable to the outside world, code within mscorlib can change the contents, so StringBuilder creates a string with a larger internal character array than the current contents requires, then appends to that string until the character array is no longer big enough to cope, at which point it creates a new string with a larger array. The string length member also contains a flag in its top bit to say whether or not the string contains any non-ASCII characters. This allows for extra optimisation in some cases.

Although strings aren't null-terminated as far as the API is concerned, the character array is null-terminated, as this means it can be passed directly to unmanaged functions without any copying being involved, assuming the inter-op specifies that the string should be marshalled as Unicode.

Encoding

(If you don't know about character encodings and Unicode, please read my article on the subject first.)

As stated at the start of the article, strings are always in Unicode encoding. The idea of "a Big-5 string" or "a string in UTF-8 encoding" is a mistake (as far as .NET is concerned) and usually indicates a lack of understanding of either encodings or the way .NET handles strings. It's very important to understand this - treating a string as if it represented some valid text in a non-Unicode encoding is almost always a mistake.

Now, the Unicode coded character set (one of the flaws of Unicode is that the one term is used for various things, including a coded character set and a character encoding scheme) contains more than 65536 characters. This means that a single char (System.Char) cannot cover every character. This leads to the use of surrogates where characters above U+FFFF are represented in strings as two characters. Essentially, string uses the UTF-16 character encoding form. Most developers may well not need to know much about this, but it's worth at least being aware of it.

Culture and internationalization oddities

Some of the oddities of Unicode lead to oddities in string and character handling. Many of the string methods are culture-sensitive - in other words, what they do depends on the culture of the current thread. For example, what would you expect "i".toUpper() to return? Most people would say "I", but in Turkish the correct answer is "I" (Unicode U+0130, "Latin capital I with dot above"). To perform a culture-insensitive case change, you can use CultureInfo.InvariantCulture, and pass that to the overload of String.ToUpper which takes a CultureInfo.

There are further oddities when it comes to comparing, sorting, and finding the index of a substring. Some of these are culture-specific, and some aren't. For instance, in all cultures (as far as I can see), "lassen" and "la\u00dfen" (a "sharp S" or eszett being the Unicode-escaped character in there) are considered equal when CompareTo or Compare are used, but not when Equals is used. IndexOf will treat the eszett as the same as "ss", unless you use a CompareInfo.IndexOf and specify CompareOptions.Ordinal as the options to use.

Some other unicode character appear to be completely invisible to the normal IndexOf. Someone asked in the C# newsgroup why a search/replace method was going into an infinite loop. It was repeatedly using Replace to replace all double spaces with a single space, and checking whether or not it had finished by using IndexOf, so that multiple spaces would collapse to a single space. Unfortunately, this was failing due to a "strange" character in the original string between two spaces. IndexOf matched the double space, ignoring the extra character, but Replace didn't. I don't know which exact character was in the real data, but it can be easily reproduced using U+200C which is a zero-width non-joiner character (whatever that means, exactly!). Put one of those in the middle of the text you're searching in, and IndexOf will ignore it, but Replace won't. Again, to make the two methods behave the same, you can use CompareInfo.IndexOf and pass in CompareOptions.Ordinal. My guess is that there's a lot of code which would fail on "awkward" data like this. (I wouldn't for a moment claim that all my code is immune, either.)

You might also like...

Comments

About the author

Jon Skeet United Kingdom

C# MVP currently living in Reading and working for Google.

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time.” - Tom Cargill