Library tutorials & articles
Strings in .NET and C#
- Introduction
- Interning, Literals and the Debugger
- Memory Usage, Encoding and Internationalization od
Memory Usage, Encoding and Internationalization od
Memory usage
In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies. The only other classes which do this (as far as I know) are arrays. Essentially, a string is a character array in memory, plus the length of the array and the length of the string (in characters). The length of the array isn't always the same as the length in characters, as strings can be "over-allocated" within mscorlib.dll, to make building them up easier. (StringBuilder does this, for instance.) While strings are immutable to the outside world, code within mscorlib can change the contents, so StringBuilder creates a string with a larger internal character array than the current contents requires, then appends to that string until the character array is no longer big enough to cope, at which point it creates a new string with a larger array. The string length member also contains a flag in its top bit to say whether or not the string contains any non-ASCII characters. This allows for extra optimisation in some cases.
Although strings aren't null-terminated as far as the API is concerned, the character array is null-terminated, as this means it can be passed directly to unmanaged functions without any copying being involved, assuming the inter-op specifies that the string should be marshalled as Unicode.
Encoding
(If you don't know about character encodings and Unicode, please read my article on the subject first.)
As stated at the start of the article, strings are always in Unicode encoding. The idea of "a Big-5 string" or "a string in UTF-8 encoding" is a mistake (as far as .NET is concerned) and usually indicates a lack of understanding of either encodings or the way .NET handles strings. It's very important to understand this - treating a string as if it represented some valid text in a non-Unicode encoding is almost always a mistake.
Now, the Unicode coded character set (one of the flaws of Unicode is that the one term is used for various things, including a coded character set and a character encoding scheme) contains more than 65536 characters. This means that a single char (System.Char) cannot cover every character. This leads to the use of surrogates where characters above U+FFFF are represented in strings as two characters. Essentially, string uses the UTF-16 character encoding form. Most developers may well not need to know much about this, but it's worth at least being aware of it.
Culture and internationalization oddities
Some of the oddities of Unicode lead to oddities in string and character handling. Many of the string methods are culture-sensitive - in other words, what they do depends on the culture of the current thread. For example, what would you expect "i".toUpper() to return? Most people would say "I", but in Turkish the correct answer is "I" (Unicode U+0130, "Latin capital I with dot above"). To perform a culture-insensitive case change, you can use CultureInfo.InvariantCulture, and pass that to the overload of String.ToUpper which takes a CultureInfo.
There are further oddities when it comes to comparing, sorting, and finding the index of a substring. Some of these are culture-specific, and some aren't. For instance, in all cultures (as far as I can see), "lassen" and "la\u00dfen" (a "sharp S" or eszett being the Unicode-escaped character in there) are considered equal when CompareTo or Compare are used, but not when Equals is used. IndexOf will treat the eszett as the same as "ss", unless you use a CompareInfo.IndexOf and specify CompareOptions.Ordinal as the options to use.
Some other unicode character appear to be completely invisible to the normal IndexOf. Someone asked in the C# newsgroup why a search/replace method was going into an infinite loop. It was repeatedly using Replace to replace all double spaces with a single space, and checking whether or not it had finished by using IndexOf, so that multiple spaces would collapse to a single space. Unfortunately, this was failing due to a "strange" character in the original string between two spaces. IndexOf matched the double space, ignoring the extra character, but Replace didn't. I don't know which exact character was in the real data, but it can be easily reproduced using U+200C which is a zero-width non-joiner character (whatever that means, exactly!). Put one of those in the middle of the text you're searching in, and IndexOf will ignore it, but Replace won't. Again, to make the two methods behave the same, you can use CompareInfo.IndexOf and pass in CompareOptions.Ordinal. My guess is that there's a lot of code which would fail on "awkward" data like this. (I wouldn't for a moment claim that all my code is immune, either.)
Related articles
Related discussion
-
Binary Studio | software development outsourcing Ukraine
by shane124 (4 replies)
-
Chart insertation in a windows form...
by pdhanik (1 replies)
-
Point of Sale Developers: Hardware & C# SDK
by ManiGovindan (7 replies)
-
help with the remote frame buffer protocol from real VNC
by poison (0 replies)
-
Need help making a complete program editable, C# or .net I think
by davelee (1 replies)
Related podcasts
-
A Practical Look at Silverlight 2 Part 1
Now that Silverlight 2 is at the Olympics and making a big splash, we wanted to explore this fascinating technology more. Microsoft Silverlight 2 is a cross-browser, cross-platform, and cross-device plug-in for delivering the next generation of .NET based media experiences and rich interactive ap...
Events coming up
-
Dec
9
GL.net Group Meeting - December 2009
Gloucester, United Kingdom
The beginning of this year holiday season will belong to mocks. Ronnie and Stephen will take us for a tour around exciting world of unit testing.
The C# decimal keyword denotes a 128-bit data type. Compared to floating-point types, the decimal type has a greater precision and a smaller range, which makes it suitable for financial and monetary calculations. http://www.deutsches-keno.de
!--removed tag-->I have someString1, but it is read from a file. I want it to appear as someString2 after calling some method. how to win roulette
!--removed tag-->Hi, i read 2 of your articles they are interesting. However, I can not figure out what or where the contents come from for the variable
readonly string[] LowNames and how it is used. I would appreciate a short description.
Do you mean you don't know how the array is populated, or you don't know why I populated it with the names I did?
To answer the first question - if you look at the code, you'll see there's a static field initializer:
static readonly string[] LowNames =
{
"NUL", "SOH", "STX", "ETX", "EOT", "ENQ", "ACK", "BEL",
"BS", "HT", "LF", "VT", "FF", "CR", "SO", "SI",
"DLE", "DC1", "DC2", "DC3", "DC4", "NAK", "SYN", "ETB",
"CAN", "EM", "SUB", "ESC", "FS", "GS", "RS", "US"
};
The names come from the man page for ASCII on a unix box
Jon
I'm working on project which converts differnt forms of temperature like Celcius, Farenheit and kelvin etc..
I also have to put up access keys to it for each conversion. Please help me with conversion function and access keys. I also have to put up access keys for reset as well as exit buttons on the form. What is is double type variable?
I'm not at all sure what this has to do with Strings, but the C# type for double precision binary floating point values is "double".
See my article on floating point arithmetic for more information. It's at http://www.pobox.com/~skeet/csharp/floatingpoint.html
(Sorry, the "insert link" button doesn't seem to work in Firefox.)
Jon
Hi, i read 2 of your articles they are interesting. However, I can not figure out what or where the contents come from for the variable
readonly string[] LowNames and how it is used. I would appreciate a short description.
I'm working on project which converts differnt forms of temperature like Celcius, Farenheit and kelvin etc..
I also have to put up access keys to it for each conversion. Please help me with conversion function and access keys. I also have to put up access keys for reset as well as exit buttons on the form. What is is double type variable?
This thread is for discussions of Strings in .NET and C#.