Deep LINQ

This article was originally published in VSJ, which is now part of Developer Fusion.
LINQ may look like SQL but this just a convenience so that LINQ to SQL can be used by people too lazy to learn new syntax. You need to think of LINQ as an attempt to bring “querying” type operations into the mainstream of the .NET language of your choice. The point is that LINQ isn’t really about database, this just happens to be a really good use of it. It’s a general purpose and extensible approach to working with structured data. In a more general setting you can think of it as yet another aspect of functional programming that is making its way into .NET. In this article I’m going to demonstrate how LINQ works at the deepest level. When you understand how LINQ is implemented you will be better able to use it, how to extend it and you can’t help but admire its simplicity.

A Query is basically an operation where you specify a subset of the data that you actually want to work with. In many ways you can think of “querying” as the point where software creation becomes complicated and the real world enters the design problem. Put simply – data is messy, usually not organised in a way that suits and the task is always more difficult and fragile than you could possibly imagine. LINQ can’t do anything about the inherent complexity of querying data, but it does deliver it in a uniform and integrated format.

The key idea is that LINQ is a set of classes and methods that will work with any class as a data source as long as it implements the IEnumerable<T> interface. As, in a sense, this is the foundation on which the rest of LINQ is built, this is a good place to start. In fact it is probably a good idea to take one step further back and look at the whole idea of enumerators.

Implementing an enumerator

The basic idea of an enumerator is to supply each item in a collection of data items one-by-one – and usually in no specified order. In .NET an enumerator is a class that provides a number of methods in addition to the basic one-by-one enumeration of the items. To be specific an enumerator has to supply:
  • Reset – initialises the enumeration so that it starts over again
  • Current – returns the current item
  • MoveNext – updates the index to the next item
Of course this all implies that there is a numeric index to the current item which starts off set to –1 to indicate that it “points” before the start of the collection. Calling Current repeatedly is allowed, but if the index is invalid then you are supposed to throw an exception. MoveNext returns true if the result is a valid index and false if the resulting index isn’t pointing to a valid item. Put together these three methods make up the IEnumerator interface and any class that supports enumeration does so by implementing this interface.

You don’t have to use a separate class to implement IEnumerator, you can do the job in the same class that implements the inner workings of the data collection if that’s convenient. It is more usual to create the enumerator as a separate class and write a constructor that creates an instance of the enumerator ready to be use. To see the simplest possible example of an enumerator, let’s create everything in a single class. Our example data collection is also going to be a little strange in that no collection of data ever exists. Instead when the collection is instantiated the constructor is supplied with the size of the collection and from then on a random number generator is used each time a data item is required. This is clearly not very useful for anything other than testing hence the name of the class:

class TestCollection:IEnumerator
Generating random data also has the advantage that the example doesn’t use any of the existing collection data types which all supply enumerators and hence tend to confuse the issue. TestCollection doesn’t make use of anything prebuilt in the .NET framework to implement its enumerator. To make it work you need to add:
using System.Collections;
To get us started we need some private variables, something to hold the instance of the random number generator, something to store the size of the collection and, of course an index to the current position in the collection:
{
	private Random rnd;
	private int Size;
	private int loc = -1;
We need the constructor to set everything up ready for the collection to be enumerated:
public testcol(int s)
{
	rnd = new Random();
	Size = s;
}
Now we have to implement the methods of the IEnumerator interface. The reset method is simply:
void IEnumerator.Reset()
{
	this.loc = -1;
}
You don’t really need the “this” but it helps to emphasise the fact that the enumerator works with the instance.

The Current method takes the form of a read only property:

object IEnumerator.Current
{
	get
	{
		if (this.loc > -1)
			return this.rnd.Next(500);
		else
			return -1;
		}
}
Notice that Current returns an object rather than an int. This is how it has to be as the interface defined Current in this way? More about this problem later. In principle we should test to make sure loc is sensible, i.e. actually indexes an element of the collection and throws an exception if it doesn’t. In this case we simply return –1. In most cases whatever is using the enumerator usually stops enumeration when the MoveNext method returns false to indicate that there is no next item:
bool IEnumerator.MoveNext()
{
	loc++;
	if (loc < this.Size) return true;
	else return false;
}
Now we have the complete enumerator and it’s very easy to see how it works, but we can’t as yet make use of it. The reason is that any class offering an enumerator has to implement the IEnumerable interface. This has just a single method GetEnumerator which returns the instance of the class that provides the methods of the IEnumerator. If you think about it for a moment it is obvious that IEnumerable has to be implemented by the data collection class that holds the items to be enumerated so we have to add it to the class definition:
class TestCollection:IEnumerator,IEnumerable
{
The single method that we have to implement is trivial as the instance of the TestCollection class concerned provides its own enumerators, i.e. it is the enumerator to be returned:
	IEnumerator IEnumerable.GetEnumerator()
	{
		return this;
	}
}
Now we can write some code that makes use of our data collection with enumeration. The simplest thing to try out is a foreach loop:
TestCollection col = new TestCollection(5);
foreach (object o in col)
{
	MessageBox.Show(o.ToString());
}
Notice that the iterator o has to be defined as an object. We can use an int in the foreach loop as it supports implicit casting. For example:
foreach (int o in col)
{
	MessageBox.Show(o.ToString());
}
…works perfectly unless you happen to return something that can’t be cast to an int at runtime with the result that an exception is raised. If you want to do better then you need to change the enumeration interfaces to their generic forms, which is what we have to do anyway to use LINQ.

Separating the enumerator

Before moving on it is worth commenting on why we usually implement the enumerator as a separate class. Consider what happens if we try to use the current enumerator in a nested foreach loop. The same enumerator would be returned by GetEnumerator and hence the nested loops would interfere with each other. If you want to write nested loops or allow multiple enumerations to happen concurrently you need to implement the enumerator as a separate class and you need to create an instance of that class each time an enumerator is called for by GetEnumerator. As nesting of queries is very common we need to do the job properly before moving on to consider LINQ.

All we need to do is separate out the enumeration methods into a new class. This class has to keep track of where it is in the enumeration and it needs to keep track of which instance of the collection it is enumerating:

class TestCollectionEnnumerator :
	IEnumerator
{
	private TestCollection m_Instance;
	private int loc = -1;
If you create the enumeration class as an inner class of the data collection, i.e. TestCollection, it will have access to all of the data collection’s variables and methods and this makes it easier for the two classes to work together. The only new method we need is a constructor that initialises the enumerator with the current instance of the data collection:
public TestCollectionEnnumerator(
	TestCollection Instance)
{
	m_Instance = Instance;
}
The other enumeration methods have to now use m_Instance instead of “this” to make sure they work with the correct instance of the data:
void IEnumerator.Reset()
{
	loc = -1;
}

object IEnumerator.Current
{
	get
	{
		if (loc > -1)
			return m_Instance.rnd.Next(500);
		else return -1;
	}
}

bool IEnumerator.MoveNext()
{
	loc++;
	if (loc < m_Instance.Size)
		return true;
	else return false;
}
We also need to change the data collection class so that its GetEnumerator actually creates a new instance of the enumerator:
class TestCollection:IEnumerable
{
	private Random rnd;
	private int Size;
	IEnumerator
		IEnumerable.GetEnumerator()
	{
		return new
			TestCollectionEnnumerator(this);
}
Notice how the use of “this” makes the final connection between the instance and any number of enumerators. Now our example code is slightly more complicated but you can use it within nested for each loops and, as we shall see, nested LINQ queries.

Generic enumeration

It is generally better to use a generic form of the interfaces so add to the start of the program:
using System.Collections.Generic;
The generic form of the IEnumerator interface inherits from the non-generic IEnumerator interface. Note that when an interface inherits from another interface, for example IA:IB then when you add IA to a class it’s exactly the same as writing :IA,IB and you have to implement the methods of IA and IB. The generic form of the IEnumerator interface defines a single generic version of Current – after all this is the only method that needs to use the data type. To complicate things a little it also inherits the Dispose method from IDisposable, but we can ignore this at the moment by adding a null implementation. The generic form of IEnumerable interface simply adds a generic form of GetEnumerator.

So to make our class use the generic Interfaces all we have to do is change its definition to:

class TestCollection:IEnumerable<int>
…and add to it a second generic GetEnumerator method:
IEnumerator<int>
	IEnumerable<int>.GetEnumerator()
{
	return new
		TestCollectionEnnumerator(this);
}
The enumerator class also has to implement the generic interface and has to have two new methods added to it:
class TestCollectionEnnumerator :
	IEnumerator<int>
{
// ...rest of class definition
New methods:
int IEnumerator<int>.Current
{
	get
	{
		if (loc > -1)
			return m_Instance.rnd.Next(500);
		else
			return -1;
	}
}
void IDisposable.Dispose()
{
}
Notice that you have to keep the existing non-generic implementations and that the new generic methods are almost identical apart from the use of the int data type. Now you can write:
foreach (int o in col)
{
	MessageBox.Show(o.ToString());
}
…and if you try to cast to something that is unreasonable, e.g. string, then you will see a compile time rather than runtime error.

LINQ and extension methods

Now we have a simple generic IEnumerable class we can start to use LINQ with it – yes it really is this simple. What the LINQ system does it to add extension methods to the IEnumerable generic interface – I’ll come back to what an extension method is after we have seen them in action. For the moment just accept the fact that there are a large number of additional methods available to any class that implemented the IEnumerable generic interface, and most of these return an IEnumerable object – which is a more important observation than you might think.

To get started let’s look at the Where extension method. If you look at its definition you will discover that it’s:

public static IEnumerable<T> Where<T>(
	this IEnumerable<T> source,
	Func<T, bool> predicate
)
Ignore the “this” for the moment because it’s part of every extension method. The first parameter, source, is an IEnumerable that will be “scanned” for all of the entities that the predicate returns true for. The predicate is just a generic delegate that “wraps” a function that accepts a single input parameter of the first specified type and returns a result of the second specified type. In this case predicate can be seen to accept a single parameter of type T and return a Boolean result. If we are going to use Where the first thing we need is a suitable predicate. This can be created in many ways, but to keep the explanation simple let’s do it the old-fashioned but rather long-winded way of first defining a suitable function and then wrapping it in a delegate. First define the function:
bool MyTest(int i)
{
	return i > 250;
}
This simply tests to see if the value is greater than 25 and returns true if so and false otherwise. To make use of this we have to first wrap it in a delegate:
Func<int, bool> MyDelegate = new Func<int, bool>(MyTest);
Next we create the data collection as before:
TestCollection col = new TestCollection(5);
And finally use the Where method with the delegate we have defined:
IEnumerable<int> q = col.Where<int>(MyDelegate);
If you try this out nothing happens. No really nothing happens… the operation of enumerating and extracting the entities smaller than 250 is only performed when it is required. In practice it is triggered by calling the generic GetEnumerator method, usually within a foreach loop. For example to see the results of the query:
foreach (int o in q)
{
	MessageBox.Show(o.ToString());
}
To emphasise – the results of the query are only stored in q when we start the foreach loop.

Of course there are many facilities in C# that can make using Where much simpler, but this is LINQ in the raw, and it illustrates exactly how it all works. I think it also makes it clear how clever the idea is and demonstrates that LINQ isn’t just SQL added to C#.

There are lots of other extension methods and as they all extend IEnumerator and as they all return an object of type IEnumerator you can simply chain them together. For example, to apply a second Where condition to the same query you would write something like:

IEnumerable<int> q = col.
	Where<int>(MyDelegate1).
	Where<int>(MyDelegate2);
If you lookup the range of extension methods provided you will see that it’s very easy to build up complicated queries, and it’s also easy to add your own extension methods to make LINQ capable of whatever query you want it to.

Syntactic sugar

So far the LINQ query example doesn’t look a great deal like any other example you will find and the reason is that it doesn’t use any of the syntactic sugar and other facilities introduced to make LINQ look more like SQL and to make it generally easier to use. So the time has come to package it all up.

The first simplification is that we can use a lambda expression to create the predicate method. This makes it possible to do away with the separate function and write something that looks like a condition as a parameter to most of the extension methods. If you recall a lambda expression is a function, without a name, that returns a single value and can be assigned directly to a delegate. For example:

Func<int, bool> MyDelegate2 = i => i > 250;
…creates a delegate that can be used in the call to Where just as in the previous example. However, as the Where method has a parameter of type Func(int,bool), we can do the job directly without the need to create an explicit delegate:
IEnumerable<int> q = col.Where<int>(i => i > 250);
With this simple change the query is beginning to look a lot more like SQL. However we can go one better. It’s up to each .NET language to choose to provide a linguistic wrapper for the LINQ system and they can do it however they like. In practice there is a lot of sense in using something close to a dialect of SQL. C# and VB both wrap LINQ with SQL-like instructions. For example, our simple query can be written:
IEnumerable<int> q = from i in col
	where i > 250
	select i;
This is compiled to the same set of method calls given earlier and is completely equivalent. You can even make it slightly simpler by allowing the compiler to work out the type of q from the return result of the method calls, i.e. use an anonymous type:
var q = from i in col where i > 250
	select i;
Now you can see why these new features were introduced to C#.

In general a LINQ query expression always starts with:

from variable in IEnumerable object
This specifies the object to be used in the query and the parameter to be passed to all of the methods, the “range” variable. Then there’s a wide choice of possible clauses which correspond to each of the extension methods and when used cause them to be called with the parameters and predicates specified. Finally every LINQ query expression has to end with a Select. The Select is easy to use but often not so easy to understand how it is implemented. The idea is that Select performs a projection or transformation on the basic data item used in the query – that is it takes a data type and projects it to a “smaller” subtype.

Custom select

The nature of the Select depends on the nature of the data item. For example, if the item takes the form of a row of a table then a select can pick out individual columns. If the data item is an array then it can select individual elements. You can see that selecting part of the data item could be a complex operation. It helps greatly to know how LINQ works in terms of extension methods in understanding what it does and how to make use of it. For example the definition of one overlay of the Select extension method is:
public static IEnumerable<TResult>
	Select<TSource, TResult>(
		this IEnumerable<TSource> source,
		Func<TSource, TResult> selector
)
You can now clearly see that you have to specify a source type and a result type and a function that does the conversion between the two types. When you write a C# query expression using Select all you do is specify the transformation function and let the compiler work out the types of the source and result data. For example,
var q = from i in col where i > 250
	select i*2;
The final select now creates a function:
Func<int, int> MySelect=i->i*2;
…and passes, as a lambda expression, in the call to the Select method which is:
var q = col.
	Where<int>(i => i > 250).
	Select<int,int>(i=>i*2) ;
A little thought reveals that this mechanism is more sophisticated than it looks. For example, you can call methods defined in the source data type to perform more complicated data transformation. For example,
var q2 = col.
	Where<int>(i => i > 250).
	Select<int,string>(i=>i.ToString());
foreach (string o in q2)
{
	MessageBox.Show(o);
}

More Selection

Understanding how the query expression is converted into calls to the query extension methods greatly clarifies how LINQ works and what you can do with it. If you want to stay with a clear design you should always use Select as a “projection operator”, that is it should reduce the data item to something that is a subset of itself. For example, if the data item is a struct then the select should specify which fields are to be extracted and returned in a “smaller” struct. This is a common and interesting task so let’s look at another simple example of select.

First we need a class or struct to hold the data:

public struct Contact
{
	public string name { get; set; }
	public string address1 { get; set; }
	public int phone { get; set; }
}
The default property implementations allow us to initialise a List of such structs directly:
List<Contact> AddressBook=new
	List<Contact>(){
	new Contact(){
		name=”mike”,
		address1=”Anywhere1”,
		phone=123},
	new Contact(){
		name=”nick”,
		address1=”Anywhere2”,
		phone=124},
	new Contact(){
		name=”john”,
		address1=”Anywhere3”,
		phone=125}
};
This is another useful addition to C# 3.0. As List supports IEnumerator we can now move immediately to using LINQ to query it. To extract a single field, the name say, you would use a query something like:
var q = from N in AddressBook
	select N.name;
foreach ( var n in q)
{
	MessageBox.Show(n);
}
Notice the use of var to avoid having to state the data type. If you don’t want to use an anonymous type then you can replace the var with string. It is easy to see how this translates to the call to the select extension method which has to return just the single field as a string but how do you return multiple fields? The problem is that you don’t have a data type, i.e. a struct, that holds a subset of fields. You could define a suitable struct, but the standard solution is to create a new anonymous type on the fly:
var q = from N in AddressBook
	select new { N.name, N.phone };
foreach ( var n in q)
{
	MessageBox.Show(n.name +
		n.phone.ToString());
}
You can see that the new facilities in C# 3.0 really are required to make LINQ look simple.

Other LINQs

Basically we have been looking at LINQ to objects, but the same principles apply to all the other implementations – LINQ to SQL, XML and ADO for example. You should now be in a position to see how these work even if you still need to dig out the details. What is even more impressive is that if you have a data source that LINQ doesn’t support you should be able to add it by implementing IEnnumerable.


Dr. Mike James’ programming career has spanned many languages, starting with Fortran. The author of Foundations of Programming, he has always been interested in the latest developments and the synergy between different languages.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” - Antoine de Saint Exupéry