Using LINQ’s ToLookup to replace dictionary creation code

I often find myself having to build a dictionary from a list of items, with each entry in the dictionary containing a subset of the list. For example, given a list of people and the shifts they’ve worked, I need to group them by shift so that all the people who worked the morning shift are grouped together. I can then operate on those sublists separately.

Rather than illustrating with a list of employees and shifts they’ve worked, imagine you have a list of words, and you want to group them by their first letter. So given the words (ant, aardvark, baboon, giraffe, tortoise, gorilla turtle), you’d end up with a list of lists that looks something like this:

A – ant, aardvark
B – baboon
G – giraffe, gorilla
T – tortoise, turtle

You could then query the data structure to get a list of all the words that begin with a particular letter.

The traditional way to do this in C# would be to create a Dictionary<char, List<string>>, and populate it as you enumerate the words list. For example:

    private Dictionary<char, List<string>> BuildWordsByFirstChar(IEnumerable<string> words)
    {
        var wordsByChar = new Dictionary<char, List<string>>();
        foreach (var word in words)
        {
            List<string> l;
            var key = word[0];
            if (!wordsByChar.TryGetValue(key, out l))
            {
                // Key doesn't exist in dictionary.
                // Make a new entry with an empty list.
                l = new List<string>();
                wordsByChar.Add(key, l);
            }
            l.Add(word);
        }
        return wordsByChar;
    }

You can then query the resulting dictionary data structure to get the list of words that start with a particular character, or you can iterate over the dictionary keys. For example, this code will display all of the keys and their associated words:

    var wordsByChar = BuildWordsByFirstChar(_words);
    foreach (var c in wordsByChar.OrderBy(kvp => kvp.Key))
    {
        Console.Write(c.Key + ": ");
        foreach (var w in c.Value)
        {
            Console.Write(w + ",");
        }
        Console.WriteLine();
    }

I’ve written such dictionary building code countless times in the past. It’s tedious. It’s also unnecessary because you can do the same thing with LINQ in a single line of code:

    var wbcLookup = _words.ToLookup(s => s[0]);

That creates a Lookup<char, string>, which is very similar to the Dictionary<char, List<string>> that the other code creates. For example, code to output the contents is almost identical:

    foreach (var c in wbcLookup.OrderBy(g => g.Key))
    {
        Console.Write(c.Key + ": ");
        foreach (var w in c)
        {
            Console.Write(w + ",");
        }
        Console.WriteLine();
    }

There are differences, however. Whereas attempting to access a non-existent key in the dictionary will result in a KeyNotFoundException, the Lookup will return an empty list. In addition, the value returned by the dictionary is a List<string> whereas the value returned by the lookup is of type IEnumerable<string>.

Lookup, by the way, is a convenient wrapper around IEnumerable<IGrouping<TKey, TElement>>. You can get much of the above functionality without using ToLookup at all, but rather by just grouping the elements like this:

    var groupedWords = _words.GroupBy(s => s[0]);

That gives you a grouping, but you can’t access the individual elements by key as you can with Lookup. If you really need to create a Dictionary<char, List<string>>, you can build it from the grouping, like this:

    var wbcDict = groupedWords.ToDictionary(c => c.Key, c => c.ToList());

Or, if you have a  Lookup:

    var wbcDict = wbcLookup.ToDictionary(c => c.Key, c => c.ToList());

But I’ve found that in most cases a Lookup will do and there’s no need to create a Dictionary.

The primary difference between ToDictionary and ToLookup is that ToDictionary is designed to create a one-to-one mapping from a list, and ToLookup creates a one-to-many mapping.

The primary point here is that with ToLookup, I can replace the dozen or more lines of procedural dictionary creation code with a single line of code that more clearly states the intention. And if I absolutely need a Dictionary, I can just call ToDictionary on the result.

Looks like a win to me.

More fun with LINQ: SelectMany

A coworker came to me with a little coding problem yesterday, and my investigation of the problem revealed a few things about C#, the .NET Framework, and LINQ. It all started with an email in which he decried the “ickiness” of this code.

    private IEnumerable<long> GetTimeWorkedAndBreakWorkTypes(IEnumerable<Row> rows)
    {
        var workTypes = new HashSet<long>();
        foreach (var row in rows)
        {
            if (!workTypes.Contains(row.WorkTypeKey))
                workTypes.Add(row.WorkTypeKey);
            if (row.IsBreaksLimited && !workTypes.Contains(row.BreakWorkTypeKey))
                workTypes.Add(row.BreakWorkTypeKey);
        }
        return workTypes.Select(wt => wt);
    }

Simply explained, he’s creating a list of all the work types that are used in that particular row set.

We can simplify that quite a bit by removing some unnecessary code. First of all, HashSet<T>.Add fails gracefully if the item passed to it already exists in the collection. The method returns True if the item is added to the to the collection. If item already exists in the collection, then Add returns False.

In addition, the return statement is unnecessarily complex. Select(wt => wt) doesn’t do anything different; it just enumerates the collection and creates another IEnumerable.

Taking advantage of those observations, the simplified code becomes:

    private IEnumerable<long> GetTimeWorkedAndBreakWorkTypes(IEnumerable<Row> rows)
    {
        var workTypes = new HashSet<long>();
        foreach (var row in rows)
        {
            workTypes.Add(row.WorkTypeKey);
            if (row.IsBreaksLimited)
                workTypes.Add(row.BreakWorkTypeKey);
        }
        return workTypes;
    }

That’s pretty standard code, and about as good as you’re going to get with looping constructs. It’s not aesthetically pleasing, though. You can improve it with LINQ.

Anybody who’s worked with LINQ is familiar with the Select method and its variants to project each element of a sequence into a new form. If all he wanted to do was get a list of the individual rows’ WorkTypeKey values, for example, he could have written:

    return rows.Select(r => r.WorkTypeKey).Distinct();

But that only provides one item per row. The goal in the original code is to provide at least one but possibly two items per row. Select can’t do that. But SelectMany can. For example, if he wanted to unconditionally return both items, he could write:

    return rows.SelectMany(r => new long[] {r.WorkTypeKey, r.BreakWorkTypeKey}).Distinct();

SelectMany lets you provide a sequence for each item, and then flattens all of those sequences into a single sequence. So the result of the SelectMany would be a single sequence of long values.

Conditionally providing that second item, though, is a little more complicated. Here’s one way to do it.

    var wt =
        rows.SelectMany(
            r => r.IsBreaksLimited
                ? new[] { r.WorkTypeKey, r.BreakWorkTypeKey }
                : new[] { r.WorkTypeKey });
    return wt.Distinct();

Note that I didn’t specify long in the array declarations. The compiler can infer the type, so in this case new[] resolves to new long[]. Whether you take advantage of the type inference is a matter of style.

I see two common objections to this type of code. The first is that it’s inefficient because it creates a list and then does a Distinct call to cull the duplicates. But that’s not what happens. You have to keep in mind that LINQ is using lazy evaluation and deferred execution. The calls to SelectMany and Distinct don’t actually do anything until some code tries to enumerate the resulting sequence (by using foreach or by calling ToList(), for example). Even then, it’s not as though the code calls SelectMany to create a list and then Distinct to cull the duplicates. What really happens is that Distinct creates a HashSet and then iterates over the rows collection, checking each row in turn. Essentially, the code does this:

    var wt = new HashSet<long>();
    foreach (var row in rows)
    {
        var items = row.IsBreaksLimited
                        ? new[] {row.WorkTypeKey, row.BreakWorkTypeKey}
                        : new[] {row.WorkTypeKey};
        wt.UnionWith(items);
    }
    return wt;

So you see that the code doesn’t really create a temporary list and then de-dupe it. It does, however, create small temporary arrays, which causes some people to raise efficiency concerns. In truth, there is a small inefficiency there. But the garbage collector is optimized to efficiently handle allocation of many small, short-lived objects. So those little arrays shouldn’t cause a problem. I might think differently if the rows list were to contain millions of items, but for the handful of items it will typically contain, there’s no reason to worry about the small additional load on the garbage collector.

That said, what I really wanted to write was:

    // NOTE: This code doesn't compile!
    var wt =
        rows.SelectMany(r =>
        {
            yield return r.WorkTypeKey;
            if (r.IsBreaksLimited)
                yield return r.BreakWorkTypeKey;
        });
    return wt.Distinct();

Unfortunately, that code doesn’t compile. For reasons that I won’t go into, you can’t use yield return in a Lambda or anonymous method. You can, however, write a separate method to do that, resulting in:

    private IEnumerable GetRowWorkTypes(Row row)
    {
        yield return row.WorkTypeKey;
        if (row.IsBreaksLimited)
            yield return row.BreakWorkTypeKey;
    }

    private IEnumerable GetTimeWorkedAndBreakWorkTypes(IEnumerable rows)
    {
        var wt = rows.SelectMany(GetRowWorkTypes);
        return wt.Distinct();
    }

I find it somewhat annoying that I have to create a separate method in order to do it this way, but the result is clear and reasonably concise.

I’m still finding out new things about LINQ. SelectMany is one of those things that I don’t need very often, but comes in very handy from time to time, this case being a prime example.