Serializing DataSets

This article was originally published in VSJ, which is now part of Developer Fusion.
The ADO.NET DataSet object is an in-memory container, and is often used to store the whole data set of an application. The DataSet object can be populated with any number of relational data tables, and optionally contains relationships between child tables. Views, constraints, and primary keys are other database-like features that the object fully supports.

The DataSet is an excellent and powerful in-memory data store for many .NET applications – in particular, for disconnected applications. The DataSet is not connected to any data source, supports various forms of serialization, and can be easily passed across the tiers of a distributed application. The DataSet is one of few ADO.NET objects to support the .NET object runtime serialization (the other object is the DataTable). In addition, it can be serialized to XML in two different ways. The embedded XML API (WriteXml and ReadXml methods) persists the DataSet to the ADO.NET normal form – a stateless format that only contains the current data – or to the DiffGram format, a stateful schema that also includes pending changes and errors. You can also streamline the contents of a DataSet using the XML serializer class.

In summary, when it comes to designing distributed applications, the DataSet seems to be a perfect fit. It is good at so many things that not using it sounds like a masochistic form of under-optimization. Is it really true? Read on.

Make your serialization choice

As detailed in the table below, the DataSet class supports a variety of serialization methods.

The DataSet serialization options

Option Description
Embedded API Based on the built-in WriteXml and ReadXml methods, this technique allows developers to persist the current contents of the DataSet object to a file or stream. The output format is XML, but you can choose among two different schemas: normal form and DiffGram. The DiffGram schema is ideal for desktop applications that can work unplugged and let users operate offline.
Runtime Object Serialization This is the standard serialization engine shared by all serializable object in the .NET Framework. Depending on the class architecture, the serialization engine, or the class itself, writes data to the serialization buffer which will then be flushed to a disk or memory stream. The .NET Remoting uses runtime serialization to move the DataSet across AppDomains. ASP.NET also takes advantage of this technique to copy the DataSet to a remote Session data store (i.e., SQL Server or state server).
XML Serializer The XmlSerializer class is the tool that the .NET Web service infrastructure uses to serialize the return values of a Web service method. The XmlSerializer has built-in support for simple types (i.e., string, numbers, dates, arrays) plus all the types (e.g., the DataSet) that implement the IXmlSerializable interface. The DataSet is serialized in a special XML format that groups under a common node the schema of the DataSet and its DiffGram representation.

Each method is designed to target a particular scenario. For example, the XML serializer class is implicitly used when the DataSet is the return value, or the input parameter, of a Web service public method. The embedded XML API works well when you want to back up the contents locally for further reuse. Finally, the runtime serialization comes in handy when the object is transferred across the network or moved to another AppDomain. Let’s review each usage in a bit more detail to identify possible limitations.

The embedded XML API

The WriteXml method (and its reading counterpart ReadXml) renders the current content of the DataSet to an XML schema. The schema can be the DataSet’s XML normal form or the DiffGram. The normal form is obtained as follows:
// No schema is generated

// Schema information is included
The following listing shows an example of the DataSet’s XML normal form. The represented DataSet is named MyDataSet and contains two tables – Users and Roles – each containing two records:
The listing below shows the structure of a DataSet DiffGram. The DiffGram also stores changes to the DataSet and unresolved errors:
		[normal form goes here]
		<Users diffgr:id=”Users1” msdata:rowOrder=”0”>
		<Roles diffgr:id=”Roles2” diffgr:Error=”Invalid !!!” />
You obtain the structure programmatically through the following code: // Create a DiffGram dataSet.WriteXml(stream3, XmlWriteMode.DiffGram); The DiffGram is more verbose but contains more information. The DiffGram is comprised of three main blocks: current snapshot, pending changes, and unresolved errors. The current snapshot is nearly identical to the normal form. A minor difference is represented by a bunch of extra attributes that basically give each record a unique ID. This ID is used to track errors and updates to the record.

When the DataSet is first populated, all of its records are marked as unchanged. As users work on the data and enter changes, the state of affected rows evolves accordingly. For example, at a certain point you can have rows marked for deletion, and rows marked as updated or added. These pending changes can be accepted or rejected at any time using proper methods – AcceptChanges and RejectChanges. If changes are accepted, the row state turns to unchanged, meaning that the current value is now considered to be the original value. At this point, the changes are cancelled.

Each row can also be associated with an error. Note that this is only a logical condition that does not affect the functionality of the DataSet. Basically, to put a row in error all that you have to do is write something to the DataRow’s Error property.

The DiffGram is a stateful format and contains a faithful representation of the object. In contrast, the normal form just takes a snapshot of the current data.

Runtime object serialization

The runtime object serialization is the .NET Framework technology behind class serialization. To be serializable, a class must be marked with the [Serializable] attribute. This attribute enables .NET formatter objects – the BinaryFormatter and SoapFormatter classes – to operate on the class and extract any properties that are worth serializing. Reflection is used to accomplish this task. Information about the assembly, the structure of the class, and current values are stored in the binary or SOAP stream of choice.
// Save the object to a binary stream
BinaryFormatter bin =
	new BinaryFormatter();
formatter.Serialize(stream, obj);
The class has no control over this whole process. To gain full control over the serialization, the class must implement the ISerializable interface. In this case, the .NET formatter yields to the class, passing a memory buffer. The class will fill the buffer with any data it wants to persist. If the class implements the ISerializable interface, the formatter calls the GetObjectData method on the interface. At this point, the class is completely responsible for its own serialization data and format. When the GetObjectData returns, the formatter just flushes the contents of the buffer to the output stream.

The DataSet class implements the ISerializable interface and makes itself personally responsible for what’s sent during cross domain and network transfers. Runtime object serialization, in fact, is used by .NET Remoting to move the DataSet across AppDomains and by ASP.NET to store a copy of the DataSet to Session, when the Session object is implemented as a remote service. I’ll return to this point in a moment.

The XML serializer

The XML serialization generates an output similar to that of the embedded XML API. The mechanism of XML serialization, though, is nearly identical to runtime object serialization. Basically, an external class – the XmlSerializer class – is instantiated, and set to work on the DataSet type. The XmlSerializer class is not as rich and powerful as a .NET formatter. In particular, the XmlSerializer doesn’t support private and protected members and can’t handle circular references among class member types.

A circular reference is when the parent object contains a reference to a child object, and the child object in turn contains a reference to the parent. For example, the DataSet contains references to child DataTable objects; in turn, a DataTable supports a property that points to the parent DataSet. According to this definition, the DataSet wouldn’t be XML serializable.

The XmlSerializer class, though, also supports all .NET classes that implement the IXmlSerializable interface. The DataSet class is just one of these classes. The XML serializer uses the WriteXml method on the interface to return a DataSet over a Web method. You can serialize the DataSet through the XmlSerializer programmatically as shown below.

XmlSerializer ser;
ser = new
ser.Serialize(stream, dataSet);
What you get through the XmlSerializer is slightly different from what you can get using the XML embedded API. In particular, the output of the XmlSerializer class for a DataSet is a DiffGram extended with schema information. This combination (diffgram plus schema) is impossible to obtain using the DataSet’s embedded XML API.

Compare the size of the output

It is important to note that all the aforementioned ways to serialize the contents of a DataSet reduce to different usages of the embedded XML API. The runtime object serialization makes calls to the ISerializable::GetObjectData method; the XML serializer makes calls to the IXmlSerializable::WriteXml method. Both methods end up calling the DataSet’s WriteXml method, although in different orders and with different parameters. This statement should give you pause for thought, especially if you’re extensively using .NET Remoting to move DataSet objects around the system.

The crucial point is that, despite what is apparently happening on surface, the DataSet always serializes itself to XML.

XML has so many qualities that it would take a book to list them all. Unfortunately, though, XML doesn’t include compactness among its good qualities. To appreciate this point, run the code shown below to fill a DataSet and save it to a local file using the three serialization APIs:

DataSet ds = new DataSet();
SqlDataAdapter da = new SqlDataAdapter(
	“SELECT * FROM [order details]”, “SERVER=…;DATABASE=…;UID=…;”);

// Save using the Runtime Object Serialization
FileStream fs1 = new FileStream(“data_ser.dat”, FileMode.Create);
BinaryFormatter bin = new BinaryFormatter();
bin.Serialize(fs1, ds);

// Save using the XML Serializer
FileStream fs2 = new FileStream(“data_ser.xml”, FileMode.Create);
XmlSerializer ser = new XmlSerializer(typeof(DataSet));
ser.Serialize(fs2, ds);

// Save using the Embedded XML API
FileStream fs3 = new FileStream(“data_api.xml”, FileMode.Create);
After filling a DataSet object with the result set of a database query, the code saves the DataSet’s contents to disk using the various APIs discussed so far: Embedded XML API, binary serialization, and XML serialization. It’s worth noting that the query command selects more than 2,000 records from the Northwind database. Sizewise, the most compact output belongs to the embedded XML API. The binary formatter generates an output only a bit (about 4%) larger. The XML serializer returns a much larger output, approximately 50% larger. That the XML serializer is the least efficient is not surprising. It generates a DiffGram, a significantly more verbose schema than normal form, and adds schema information too. The results of the binary formatter, however, deserve more attention.

Why is the output of a binary tool so large – even larger than a plain XML? The answer to this question is hinted at in a previous statement – the one that should have set you thinking.

However you implement it, the DataSet serialization always passes through one or more calls to the WriteXml embedded method. This means that whatever serialization mechanism you use, and whatever type of output stream you employ, a serialized DataSet is always made of plenty of XML data. This may have serious implications to .NET Remoting and ASP.NET applications. Let’s first understand why this is so, and then devise some possible workarounds.

Serialize the DataSet to a binary stream

When you serialize an object to a binary formatter, the Serialize method uses a binary writer object to send the object graph to the stream you provide. If the object being serialized doesn’t implement the ISerializable interface, the formatter issues a call to the FormatterServices.GetSerializableMembers method and obtains the list of serializable members for the specified object type. The static method returns an array of MemberInfo objects. The formatter scrolls down the array and writes data to the underlying stream. Any data is written using a binary writer. In addition, a header with assembly and type information is inserted at the beginning of the stream. The .NET formatter controls the serialization process and guarantees that data is copied faithfully – i.e. numbers are copied as numbers, strings as strings, binary objects as binary objects. The size of the resulting data is kept to a reasonable size, and represents the best you can obtain without adding an extra (and explicit) compression step.

Let’s see what happens when a DataSet object is serialized. As mentioned earlier, the DataSet class implements the ISerializable interface, a factor that changes the way in which the formatter works. In this case, the binary formatter passes control of the process to the class itself by invoking the GetObjectData method on the ISerializable interface. The listing below shows some pseudo-code that illustrates the behaviour of the ISerializable::GetObjectData method:

void GetObjectData(SerializationInfo info, StreamingContext context)
	// Add schema information
	string schema = GetXmlSchema();
	info.AddValue(“XmlSchema”, schema);

	// Add data information (diffgram)
	StringWriter writer = new StringWriter();
	XmlTextWriter textWriter = new XmlTextWriter(writer);
	WriteXml(textWriter, XmlWriteMode.DiffGram);
	string diffgram = writer.ToString();
	info.AddValue(“XmlDiffGram”, diffgram);

The signature of the method is shown below.

void GetObjectData(
	SerializationInfo info,
	StreamingContext context)
The GetObjectData method receives two arguments – a SerializationInfo object and the streaming context. The streaming context indicates the source or destination of the bits that the formatter is using. In other words, it provides information as to whether the data being copied are going to be transferred to a file, or moved across processes or machines.

The SerializationInfo object holds all the data needed to serialize or deserialize an object. It is a sort of super-array that gets filled with property names, types, and values. The interaction between a serializable class and the SerializationInfo object is pretty simple. As you can see in the pseudo-code listing above, the class invokes the AddValue method on the SerializationInfo object to pass a name/value pair. The name may refer to a property defined on the class but, more generally, is just the descriptive name of a value. Likewise, the value can be value of a property or the result of an expression calculated on the existing properties. The class is a completely free way of storing data using any logical schema that the author reckons optimal.

info.AddValue(“XmlSchema”, schema);
The DataSet describes its contents using two abstract elements – schema and diffgram – and doesn’t adjust its serialization algorithm according to streaming context. In other words, it always renders itself as a diffgram with schema. It goes without saying that any text data (e.g. XML data) is written to a binary stream as-is, with no form of implicit compression like the compression that may result for numbers.

The impact of this little-known feature on the scalability and performance of distributed systems is obvious. When a DataSet is persisted to disk, remoted, or transferred over a network, it travels as a large XML chunk of data. Depending on the volume of data that your system handles, these could easily be tens of megabytes. As a matter of fact, in the .NET Framework 1.x, using the DataSet in a multi-tier application is a double-edged sword. It is easy to code and effective from a functional viewpoint; it is hardly optimal performance-wise because you move more bytes than the actual size of the data set itself. With an apparently simpler DataTable object, things are even worse due to a bug in the .NET Framework 1.x. The size of a binary serialized DataTable is even larger than the size of a DataSet that contains only that table!

It’s not only applications using .NET Remoting that should be reviewed. The blacklist of applications that might be affected expands to include ASP.NET applications as well. To be honest, only a specific subset of ASP.NET applications are at risk. Using DataSet and DataTable objects in ASP.NET is effective and risk-free, as long as you don’t store these objects in the session state. Storing DataSet and DataTable objects in Session is a dangerous behaviour only for Web farms and for applications that implement a distributed session state management. If the session state is held in memory, no extra serialization cost is ever paid for data objects. When the session is stored in SQL Server or in the ASP.NET state server – a Windows NT Service living in another AppDomain – placing a DataSet or a DataTable in Session results in the binary serialization of the object with the drawbacks we’ve discussed so far.

Find a workaround

The serialization issue of the DataSet object is likely to be resolved in the next version of the .NET Framework, codenamed Whidbey (although the problem wasn’t addressed in the PDC build of Whidbey). In the meantime, you can use a custom class derived from the DataSet (or DataTable) that serializes in a truly binary way. The following code snippet shows the header of the class.
class BinDataSet : DataSet,
	public BinDataSet() : base() {}
	public BinDataSet(string name) :
		base(name) {}
	protected BinDataSet(
		SerializationInfo si,
		StreamingContext context)
The class must be serializable and implement the ISerializable interface. It needs to have a protected constructor with the typical serialization signature. That constructor will be called when the formatter deserializes the object. The class will serialize in the GetObjectData method and deserialize in the protected constructor. The schema of the serialized data is up to you and admittedly coming up with a general schema suitable for all uses and scenarios is not an easy task. The idea is that the whole content of a DataSet must be written to the SerializationInfo buffer. This includes, tables, relations, extended properties, views, plus a bunch of properties like DataSetName, Prefix, and HasErrors, just to name a few.

Serializing a DataTable means serializing column information and rows. Serializing a DataRelation means writing out parent and child column names plus a few more properties. The code shown below (and to a broader extent, the full source code of this article) demonstrates how to proceed. The sample code only saves a small amount of information about tables and relations. This is sufficient to restore a good enough DataSet but lacks pieces of information that might be crucial in other cases.

void System.Runtime.Serialization.ISerializable.GetObjectData(
	SerializationInfo si, StreamingContext context)
	foreach(DataTable t in Tables)
		SerializeTable(si, t);

	foreach(DataRelation r in Relations)
		SerializeRelation(si, r);

	// TO DO :: Serialize extended properties

	// TO DO :: Serialize individual properties

void SerializeTable(SerializationInfo si, DataTable t)
	ArrayList colNames = new ArrayList();
	ArrayList colTypes = new ArrayList();
	ArrayList dataRows = new ArrayList();

	// Insert column information into worker arrays
	foreach(DataColumn col in t.Columns)

	// Insert rows information into a worker array
	foreach(DataRow row in t.Rows)

	// Pack for serialization
	object[] tableInfo = new object[3];
	tableInfo[0] = colNames;
	tableInfo[1] = colTypes;
	tableInfo[2] = dataRows;

	// Add to the serialization buffer
	si.AddValue(“Tbl_” + t.TableName, tableInfo);
The code above creates a serialization entry per each table and relation and associates the entry with an array of data – typically, an array of arrays. The same data are reloaded in the protected constructor invoked by the .NET formatter’s Deserialize method.

The sample code includes a console application that writes disk files with the serialized DataSet. As you can see, using this simple (but incomplete) algorithm the amount of data transferred is about one-third that transferred with the Embedded XML API, the otherwise most efficient technique. Keep in mind, though, that the effective number of bytes (and therefore the percentage) doesn’t include the whole information around a DataSet. To improve the solution further, you can also consider using a compression algorithm. A compression algorithm alone, though, will alleviate the problem but can’t fix it.

Dino Esposito is a trainer and consultant for Wintellect where he manages the ADO.NET class. His books include Programming ASP.NET, and he is a frequent speaker at industry events.

You might also like...


About the author

Dino Esposito United Kingdom

Dino Esposito is an instructor for Solid Quality Mentors, and a trainer and consultant based in Rome. He is the author of various books, including Windows Shell Programming, Instant DHTML Script...

Interested in writing for us? Find out more.


Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.”