.NET
- C#
- VB.NET
- F#
- Azure
- ASP.NET
  - ASP.NET AJAX
  - ASP.NET MVC
- LINQ
- ADO.NET
Java
Open Source
Mobile
Database
Architecture
RIA & Web
Toolbox

Strings in .NET and C#

21 Jul 2005 | by Jon Skeet | Filed in

Comments
PDF

Introduction
Interning, Literals and the Debugger
Memory Usage, Encoding and Internationalization od

Memory Usage, Encoding and Internationalization od

Memory usage

In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies. The only other classes which do this (as far as I know) are arrays. Essentially, a string is a character array in memory, plus the length of the array and the length of the string (in characters). The length of the array isn't always the same as the length in characters, as strings can be "over-allocated" within mscorlib.dll, to make building them up easier. (StringBuilder does this, for instance.) While strings are immutable to the outside world, code within mscorlib can change the contents, so StringBuilder creates a string with a larger internal character array than the current contents requires, then appends to that string until the character array is no longer big enough to cope, at which point it creates a new string with a larger array. The string length member also contains a flag in its top bit to say whether or not the string contains any non-ASCII characters. This allows for extra optimisation in some cases.

Although strings aren't null-terminated as far as the API is concerned, the character array is null-terminated, as this means it can be passed directly to unmanaged functions without any copying being involved, assuming the inter-op specifies that the string should be marshalled as Unicode.

Encoding

(If you don't know about character encodings and Unicode, please read my article on the subject first.)

As stated at the start of the article, strings are always in Unicode encoding. The idea of "a Big-5 string" or "a string in UTF-8 encoding" is a mistake (as far as .NET is concerned) and usually indicates a lack of understanding of either encodings or the way .NET handles strings. It's very important to understand this - treating a string as if it represented some valid text in a non-Unicode encoding is almost always a mistake.

Now, the Unicode coded character set (one of the flaws of Unicode is that the one term is used for various things, including a coded character set and a character encoding scheme) contains more than 65536 characters. This means that a single char (System.Char) cannot cover every character. This leads to the use of surrogates where characters above U+FFFF are represented in strings as two characters. Essentially, string uses the UTF-16 character encoding form. Most developers may well not need to know much about this, but it's worth at least being aware of it.

Culture and internationalization oddities

Some of the oddities of Unicode lead to oddities in string and character handling. Many of the string methods are culture-sensitive - in other words, what they do depends on the culture of the current thread. For example, what would you expect "i".toUpper() to return? Most people would say "I", but in Turkish the correct answer is "I" (Unicode U+0130, "Latin capital I with dot above"). To perform a culture-insensitive case change, you can use CultureInfo.InvariantCulture, and pass that to the overload of String.ToUpper which takes a CultureInfo.

There are further oddities when it comes to comparing, sorting, and finding the index of a substring. Some of these are culture-specific, and some aren't. For instance, in all cultures (as far as I can see), "lassen" and "la\u00dfen" (a "sharp S" or eszett being the Unicode-escaped character in there) are considered equal when CompareTo or Compare are used, but not when Equals is used. IndexOf will treat the eszett as the same as "ss", unless you use a CompareInfo.IndexOf and specify CompareOptions.Ordinal as the options to use.

Some other unicode character appear to be completely invisible to the normal IndexOf. Someone asked in the C# newsgroup why a search/replace method was going into an infinite loop. It was repeatedly using Replace to replace all double spaces with a single space, and checking whether or not it had finished by using IndexOf, so that multiple spaces would collapse to a single space. Unfortunately, this was failing due to a "strange" character in the original string between two spaces. IndexOf matched the double space, ignoring the extra character, but Replace didn't. I don't know which exact character was in the real data, but it can be easily reproduced using U+200C which is a zero-width non-joiner character (whatever that means, exactly!). Put one of those in the middle of the text you're searching in, and IndexOf will ignore it, but Replace won't. Again, to make the two methods behave the same, you can use CompareInfo.IndexOf and pass in CompareOptions.Ordinal. My guess is that there's a lot of code which would fail on "awkward" data like this. (I wouldn't for a moment claim that all my code is immune, either.)

You might also like...

Comments

About the author

Jon Skeet

C# MVP currently living in Reading and working for Google.

www.pobox.com

Interested in writing for us? Find out more.

.NET tutorials

.NET books

Expert WCF 4: SOA 2.0 with Windows Communication Foundation 4

Windows Communication Foundation has become an integral part of many .NET based solutions, enabling highly customizable messaging across distributed environments. In Expert WCF 4, you will cover scenarios that include designing, implementing, consumi...

.NET forum discussion

edmonton female escort services near me

by canadapleasure (0 replies)
USB Drive Activity Logger!

by coque0912 (7 replies)
Which is harder to learn Java or C++ ?

by surbhinahta (114 replies)
C ++ public int __cdecl printf (const char * __restrict__ _Format, ...) problem.

by sgameyta (0 replies)
Bagaimana memenangkan $ 1,54 miliar dalam Mega Jutaan

by gametogelan (0 replies)

.NET podcasts

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

20 mars 2013 (Ãmission #0157) ::.Louis-Philippe Pinsonneault: Le "App store" de Windows 8Nous discutons avec Louis-Philippe Pinsonneault du "App store" de Windows 8. Nous essaieront de couvrir tout ce quâil y a Ã savoir sur le "App store" : les types de licences, les modÃ¨les de reven.

.NET jobs

Web Systems Developer

Red Gate Software in Cambridge, United Kingdom
45,000
Web Application Developer

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year
Senior Software Engineer

@ One Limited in London, United Kingdom
Jr. .NET Developer

T-Symmetry in Lakewood, United States

Managed hosting by Everycity

Strings in .NET and C#

Memory Usage, Encoding and Internationalization od

Memory usage

Encoding

Culture and internationalization oddities

You might also like...

Comments

About the author

Jon Skeet

.NET tutorials

.NET books

Expert WCF 4: SOA 2.0 with Windows Communication Foundation 4

.NET forum discussion

edmonton female escort services near me

by canadapleasure (0 replies)

USB Drive Activity Logger!

by coque0912 (7 replies)

Which is harder to learn Java or C++ ?

by surbhinahta (114 replies)

C ++ public int cdecl printf (const char * restrict__ _Format, ...) problem.

by sgameyta (0 replies)

Bagaimana memenangkan $ 1,54 miliar dalam Mega Jutaan

by gametogelan (0 replies)

.NET podcasts

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

.NET jobs

Web Systems Developer

Red Gate Software in Cambridge, United Kingdom
45,000

Web Application Developer

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year

Senior Software Engineer

@ One Limited in London, United Kingdom

Jr. .NET Developer

T-Symmetry in Lakewood, United States

Contribute

Web Development

Developer Jobs

Our tools

Strings in .NET and C#

Memory Usage, Encoding and Internationalization od

Memory usage

Encoding

Culture and internationalization oddities

You might also like...

Comments

About the author

by canadapleasure (0 replies)

by coque0912 (7 replies)

by surbhinahta (114 replies)

by sgameyta (0 replies)

by gametogelan (0 replies)

Visual Studio Talk Show (en français): Louis-Philippe Pinsonneault

Published 7 years ago, running time 1h12m

Red Gate Software in Cambridge, United Kingdom 45,000

Red Gate Software in Cambridge, United Kingdom £35,000-45,000 GBP per year

@ One Limited in London, United Kingdom

T-Symmetry in Lakewood, United States

Contribute

Web Development

Developer Jobs

Our tools

Red Gate Software in Cambridge, United Kingdom
45,000

Red Gate Software in Cambridge, United Kingdom
£35,000-45,000 GBP per year