LINQ to log files

This article was originally published in VSJ, which is now part of Developer Fusion.

Many people hear “LINQ” and think “databases”. This is not surprising given the emphasis Microsoft has placed on LINQ to SQL and Entity Framework, but I believe that LINQ to Objects has had an incredibly powerful impact too. It’s less immediately impressive – there’s less to go “ooh” and “aah” over – but data manipulation and filtering occurs in all layers of an application, rather than just at the application/database boundary.

One of the misconceptions about LINQ to Objects is that it means “working with in-memory collections”. While it’s true that the processing occurs in memory, that certainly doesn’t mean that all of your data has to be in memory at the same time. One of the key concepts in LINQ is that data is streamed wherever possible: it deals with sequences of data, filtering and transforming as it goes along. In this article I’ll explore a sample situation, processing log files, and show how LINQ can make your code more readable and more powerful than ever before.

Friday afternoon crunch: log file panic!

You may have noticed that examples in articles and web sites tend to be a little contrived, and I’m not going to claim that this is one any different. After all, real world architectures and problems tend to be much too complicated to describe and solve within a few pages. However, the principles applied here are also relevant to many other situations, so try to bear in mind your current tasks (and any that you’ve recently achieved) as you read on, to see how LINQ to Objects could help you rather than a fictitious coder reading log files. That said, let’s step into the world of 5-o’clock-on-a-Friday-and-panic-is-setting-in.

Your boss has come to you on Friday afternoon with a big problem. Apparently a lot of your customers have been having issues with your servers during the week, but due to various miscommunications, the scale of the problem has only just been truly understood. The CEO is demanding that all the customers who have experienced difficulties – whether they’ve reported them or not – should be phoned immediately. Those who have had the most problems should be called first, if possible.

Fortunately, your servers all keep log files of any exceptions or other issues, and a quick look at the start of one of the logs shows that the errors have been recorded. Now all you’ve got to do is parse the files, work out which customers have had the most problems, and find all their phone numbers. The log files contain the customer IDs, but of course the contact details are in your database.

Normally, this would sound like a job for a script of some kind – but can C# 3 do any better?

Iterating over log entries

Let’s start off by working out how we’ll look through the logs at all. Each line is in the following format, tab-separated, and encapsulated in a LogEntry class which provides parsing and formatting facilities:

  • Date/time (UTC, in the format yyyy-MM-dd HH:mm:ss.fff, e.g. 2007-11-25 22:47:23.934)
  • Entry type (Error, Warning, Information, Trace, Audit, or Performance)
  • Customer ID (or blank when unknown)
  • Message (all on one line)

Some of the log files are really big – we certainly don’t want to read them straight into memory, not even one file at a time. Instead, we should be able to read a single line at a time. If we were doing this in C# 2, we might use a while loop and StreamReader.ReadLine() but if we’re going to use LINQ, we’ll need something that implements IEnumerable<string>. Irritatingly enough, .NET 3.5 doesn’t actually provide that functionality out of the box, but thanks to the iterator blocks of C# 2, it’s very easy to implement it ourselves:

public IEnumerator<string>
    GetEnumerator()
{
    using (StreamReader reader =
    	new StreamReader(file))
    {
    	string line;
    	while ( (line=reader.ReadLine())
    		 != null)
    	{
    		yield return line;
    	}
    }
}

At this point it’s very easy to iterate through all the lines in a single file, with or without LINQ. With LINQ, however, we can do better than that – we can easily construct a query which iterates through all the lines in a whole set of files, and converts them on the fly into a strongly typed log entry:

var query = from file in
    Directory.GetFiles(
    logDirectory, "*.log")
from line in new LineReader(file)
    select new LogEntry(line);

If you haven’t seen the var syntax before, you might worry that C# 3 had become weakly typed – but no, it’s just the compiler being smart. We could have declared the query variable as being of type IEnumerable<LogEntry> and the compiled code would have been exactly the same.

Another thing which might catch you out if you haven’t used LINQ before is that the above statement does almost nothing when it’s executed. The only bit of actual processing it will do is make a call to Directory.GetFiles() – it doesn’t actually open any of the files, or read any lines, or parse them. This is called deferred execution – it’s really neat, but it can take a little while to get your head round. You can think of the query expression as being a bit like assembling lots of water pipes and hoses – once you’ve joined everything together, it’s all ready to go, but nothing’s going to flow through it until you turn the tap on.

So, when does the data actually get loaded? It’s only when you start to iterate through query. For instance, to just show every log line, we could use:

foreach (LogEntry entry in query)
{
    Console.WriteLine(entry);
}

Even while we iterate through the entries, only a single line is loaded at a time, and only a single file is ever open at a time. Not all the LINQ Standard Query Operators are able to do this, however. For instance, if we had included an orderby clause in the query, all of our log files would have to be parsed and kept in memory for sorting before any results could be returned – that’s just inherent in the nature of ordering a sequence. When you’re dealing with a lot of data, it’s important to know what will happen in your data pipeline: MSDN is pretty good at explaining what each operator will do.

Errors and customers

Now, showing every log line wasn’t what we originally wanted. We only want to find the errors – and we want to find out more information about the customers who experienced them. The first part is very easy. Filtering in LINQ is done with the where clause, but we want to filter on a property of the current result – the LogEntry object. We need to do the parsing before we do the filtering. There are various ways we could do it, but the easiest is probably with a let clause to introduce an extra range variable. Here’s the query expression with the filtering in place:

var query = from file in
    Directory.GetFiles(logDirectory, "*.log")
from line in new LineReader(file)
    let entry = new LogEntry(line)
    where entry.Type == LogEntryType.Error
    select entry;

It’s slightly harder to get the customer information. That’s stored in the database, and we really don’t want to do a query for each log line. There are a few options here. The simplest is to just load all of the customer information from the database once, and look it up in memory for each log line. This will obviously not work terribly well if you’ve got more customers than you can fit into memory, or if you believe there are only a few customers affected but fetching the data for all customers will take a long time. However, in the common case where you’ve only got a few thousand customers – or even a few tens of thousands – it’s really not too bad just to load the lot into memory.

For the sake of keeping this article simple, let’s assume we’re in the pleasant situation of being able to load all the customer details. In the sample code, I’ve used the trusty Northwind database as the data source. You can use whatever data access method you like to access the database, so long as the result is enumerable and you can get at the customer ID easily – I’ve used LINQ to SQL, partly because doing so takes about two minutes with the designer.

Working out which customer is associated with each log entry is then just a matter of joining our two data sources together – but you must do it in the right way, to avoid reading all the log data into memory. Here’s the correct query, where db is a reference to a NorthwindDataContext which can give us the customer data. After we’ve seen how to do the right thing, we’ll think about the consequences of getting it wrong.

var query = from file in
    Directory.GetFiles(
    logDirectory, "*.log")
    from line in new LineReader(file)
    let entry = new LogEntry(line)
    where entry.Type == LogEntryType.Error
    join customer in db.Customers
    on entry.CustomerID equals
    customer.CustomerID
    select new { Entry = entry,
    Customer = customer };

The important choice to make whenever you do a join is to decide which data source should be the primary one, and which should be the secondary. The primary source is the one which leads into the join – the log entries, in this case. The secondary data source is the one introduced by the join clause – the customer records here. On the part of the join clause which defines how records are matched (the x equals y part) you can only use the primary data source to the left of equals, and only use the secondary data source to the right of it. If you get it the wrong way round you’ll get errors saying that the relevant variables aren’t in scope.

Even when you’ve got a query which compiles and gives the right data, it may not do behave as you want it to. The two data sources in the join are treated in very different ways. The secondary data source is enumerated in its entirety to build a lookup table, and then the primary data source is streamed, with the join returning results as it goes. That’s why I’ve used the log entries as the primary source: I’ve assumed that there could be millions of log entries, but only thousands of customers. We don’t want to keep all the log records in memory – we just want to process them one at a time. You may have spotted the limitation here: you can’t easily join two data sources which are both too large to fit into memory. We’ve already assumed that we can load all the customers, of course.

Iterating through the entries now shows you only the errors, and gives associated customer information. Incidentally, due to the nature of the join we’ve also discarded any errors which didn’t have a customer ID – whether that is a problem or not would depend on the real life situation, but it’s beyond the scope of this article.

In order to complete our task, we have to group the errors by customer and then order the groups, so that we can contact the customers who have had the most problems first.

Ordering by group size: LINQ lets us down

You may be surprised to hear that LINQ to Objects doesn’t come with anything that can fill our needs immediately. The following query expression will work, but see if you can spot the problem.

var query = from file in
    Directory.GetFiles(
    logDirectory, "*.log")
    from line in new LineReader(file)
    let entry = new LogEntry(line)
    where entry.Type ==
    LogEntryType.Error
    join customer in db.Customers
    on entry.CustomerID equals
    customer.CustomerID
    group entry by customer
    into entriesByCustomer
    let count = entriesByCustomer.Count()
    orderby count descending
    select new { Customer =
    entriesByCustomer.Key,
    Count = count };

The first problem is that it’s a bit of a monster of a query! If you’re not used to LINQ, you could easily be frightened off – but take it one step at a time, and it’s not too bad. We’ve already seen as far as the join clause, so the next step is the group clause. Here we’re grouping the entries by customer. The result is a sequence of groups: each group has a key (the customer) and a subsequence (the log entries relating to that customer).

For each of those groups, we want to count the number of entries, which is what the let clause does. We can then order by that count, and make the final result a sequence of entries containing just the count and the customer. Phew! With this query, here’s some code to print out the results our sales reps need in order to contact the customers appropriately:

foreach (var result in query)
{
    Console.WriteLine("{0}: {1} {2}",
    	result.Count,
    	result.Customer.CompanyName,
    	result.Customer.Phone);
}

Okay, so now we know how it works, what’s the next problem? Well, it’s far from efficient. In particular, having streamed the data for quite a long time, we’re now building it all up in the group clause. We could make things slightly better by using a different group clause to discard the actual log entry data – something like this:

group 1 by customer into entriesByCustomer

Each group is then just a sequence of 1s – but you’d still end up with some memory taken up for every log entry, which is what we’ve been trying to avoid. We really want to do a “group and count” in one go, because all we need is the count. Feeling adventurous? Let’s build the functionality we need, in such a way that we’ll be able to use it in the future too.

Extending LINQ

LINQ isn’t magic – it’s clever, but the brilliance is really in the design more than the implementation, at least as far as LINQ to Objects is concerned. That’s good, because it means it’s really easy to extend when it doesn’t quite meet our needs. We want to be able to group whatever sequence we’ve currently got based on some key, and return a sequence of entries each of which has a key and a count. If we write that as an extension method on IEnumerable<T>, we can use it almost seamlessly. (We obviously won’t have built-in language support, but it’ll be as much a first class LINQ citizen as query operators like DefaultIfEmpty, Reverse and Except.

The first thing to do is work out what the method should look like. We know we’ll need to return key/value pairs, where the value is the count – so we might as well use the KeyValuePair<TKey,TValue> type already defined in the framework. We know that we’ll need to operate on a sequence of data – an IEnumerable<T> – and also have some way of transforming an element into a key, for which we’ll use the Func<T, TResult> delegate. The method will involve two type parameters – one for the type of the input element, and one for the type of the key. Putting all this together and making it an extension method by putting “this” on the first parameter gives the slight fearsome signature below:

public static IEnumerable
    <KeyValuePair<TKey,int>>
    GroupAndCount<TElement,TKey>
    (this IEnumerable<TElement> source,
    Func<TElement,TKey> mapping)

I used to really hate method signatures like this, and it still takes me a little while to understand them. It’s a skill which definitely grows with practice, however – and if you take each part of it one bit at a time, it’s not as hard to understand as it first appears.

The implementation of the method is actually fairly easy once we know the signature, particularly due to our use of KeyValuePair<TKey,TValue>. After all, Dictionary<TKey,TValue> implements IEnumerable<KeyValuePair<TKey,TValue>, so if we can just make a dictionary with the right entries in, we’re done! All we need to do is start with an empty dictionary, and then look at each of the elements of input data one at a time. We transform each element into a key, and see whether we’ve already got that key in our dictionary. If we have, we increase its value – if not, we just start with a value of 1. Then we can return the dictionary and let it handle enumerating through the entries. Here’s the full code, ready to be placed in a static class:

public static
    IEnumerable<KeyValuePair<TKey, int>>
    GroupAndCount<TElement, TKey>
    (this IEnumerable<TElement>
    source, Func<TElement, TKey> mapping)
{
    Dictionary<TKey, int> dictionary =
    	new Dictionary<TKey, int>();
    foreach (TElement element in source)
    {
    	TKey key = mapping(element);
    	if (dictionary.ContainsKey(key))
    	{
    		dictionary[key]++;
    	} else {
    		dictionary[key] = 1;
    	}
    }
    return dictionary;
}

You can imagine this could be useful in a number of situations. Let’s apply it to our current problem and see what the result looks like:

var entries = from file in
    Directory.GetFiles(
    logDirectory, "*.log")
    from line in new LineReader(file)
    let entry = new LogEntry(line)
    where entry.Type == LogEntryType.Error
    join customer in db.Customers
    on entry.CustomerID equals
    customer.CustomerID
    select new { Entry = entry,
    Customer = customer };

var customerCounts =
    entries.GroupAndCount(grp =>
    grp.Customer).OrderByDescending(
    pair => pair.Value);
foreach (var result in customerCounts)
{
    Console.WriteLine("{0}: {1} {2}",
    	result.Value,
    	result.Key.CompanyName,
    	result.Key.Phone);
}

We could have done the whole query in one statement, but I personally prefer to have query expressions on their own, and then manipulate the results in a separate statement. Due to the deferred execution and streaming nature of LINQ, it makes no difference in execution terms. Likewise we could have used a query expression to perform the ordering of the results – but there seems little point in this case.

One slight point of ugliness is that KeyValuePair<TKey,TValue> has properties of Key and Value where ideally we’d like Key and Count to make it more readable. Obviously that’s achievable – it just means a bit more work in the extension method.

Conclusion

In the end, we’ve got what we wanted – an efficient solution to the problem which is much more readable than any C# alternative would have been. Scripting fans may well have preferred other solutions, but LINQ to Objects has kept us within our comfortable world of C# and still delivered a neat answer.

We’ve seen how LINQ doesn’t always give you all the tools you need straight out of the box – but it’s easy to extend it as and when you need to, and using generics wisely allows those extensions to become part of your own toolkit for reuse in other situations.

Some of our source data came from a database, and some of it came from log files – LINQ really didn’t care. So next time you hear “LINQ”, don’t think “databases” but “any data, any time!”

You might also like...

Comments

About the author

Jon Skeet United Kingdom

C# MVP currently living in Reading and working for Google.

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The difference between theory and practice is smaller in theory than in practice.”