More fun with LINQ: SelectMany

A coworker came to me with a little coding problem yesterday, and my investigation of the problem revealed a few things about C#, the .NET Framework, and LINQ. It all started with an email in which he decried the “ickiness” of this code.

    private IEnumerable<long> GetTimeWorkedAndBreakWorkTypes(IEnumerable<Row> rows)
    {
        var workTypes = new HashSet<long>();
        foreach (var row in rows)
        {
            if (!workTypes.Contains(row.WorkTypeKey))
                workTypes.Add(row.WorkTypeKey);
            if (row.IsBreaksLimited && !workTypes.Contains(row.BreakWorkTypeKey))
                workTypes.Add(row.BreakWorkTypeKey);
        }
        return workTypes.Select(wt => wt);
    }

Simply explained, he’s creating a list of all the work types that are used in that particular row set.

We can simplify that quite a bit by removing some unnecessary code. First of all, HashSet<T>.Add fails gracefully if the item passed to it already exists in the collection. The method returns True if the item is added to the to the collection. If item already exists in the collection, then Add returns False.

In addition, the return statement is unnecessarily complex. Select(wt => wt) doesn’t do anything different; it just enumerates the collection and creates another IEnumerable.

Taking advantage of those observations, the simplified code becomes:

    private IEnumerable<long> GetTimeWorkedAndBreakWorkTypes(IEnumerable<Row> rows)
    {
        var workTypes = new HashSet<long>();
        foreach (var row in rows)
        {
            workTypes.Add(row.WorkTypeKey);
            if (row.IsBreaksLimited)
                workTypes.Add(row.BreakWorkTypeKey);
        }
        return workTypes;
    }

That’s pretty standard code, and about as good as you’re going to get with looping constructs. It’s not aesthetically pleasing, though. You can improve it with LINQ.

Anybody who’s worked with LINQ is familiar with the Select method and its variants to project each element of a sequence into a new form. If all he wanted to do was get a list of the individual rows’ WorkTypeKey values, for example, he could have written:

    return rows.Select(r => r.WorkTypeKey).Distinct();

But that only provides one item per row. The goal in the original code is to provide at least one but possibly two items per row. Select can’t do that. But SelectMany can. For example, if he wanted to unconditionally return both items, he could write:

    return rows.SelectMany(r => new long[] {r.WorkTypeKey, r.BreakWorkTypeKey}).Distinct();

SelectMany lets you provide a sequence for each item, and then flattens all of those sequences into a single sequence. So the result of the SelectMany would be a single sequence of long values.

Conditionally providing that second item, though, is a little more complicated. Here’s one way to do it.

    var wt =
        rows.SelectMany(
            r => r.IsBreaksLimited
                ? new[] { r.WorkTypeKey, r.BreakWorkTypeKey }
                : new[] { r.WorkTypeKey });
    return wt.Distinct();

Note that I didn’t specify long in the array declarations. The compiler can infer the type, so in this case new[] resolves to new long[]. Whether you take advantage of the type inference is a matter of style.

I see two common objections to this type of code. The first is that it’s inefficient because it creates a list and then does a Distinct call to cull the duplicates. But that’s not what happens. You have to keep in mind that LINQ is using lazy evaluation and deferred execution. The calls to SelectMany and Distinct don’t actually do anything until some code tries to enumerate the resulting sequence (by using foreach or by calling ToList(), for example). Even then, it’s not as though the code calls SelectMany to create a list and then Distinct to cull the duplicates. What really happens is that Distinct creates a HashSet and then iterates over the rows collection, checking each row in turn. Essentially, the code does this:

    var wt = new HashSet<long>();
    foreach (var row in rows)
    {
        var items = row.IsBreaksLimited
                        ? new[] {row.WorkTypeKey, row.BreakWorkTypeKey}
                        : new[] {row.WorkTypeKey};
        wt.UnionWith(items);
    }
    return wt;

So you see that the code doesn’t really create a temporary list and then de-dupe it. It does, however, create small temporary arrays, which causes some people to raise efficiency concerns. In truth, there is a small inefficiency there. But the garbage collector is optimized to efficiently handle allocation of many small, short-lived objects. So those little arrays shouldn’t cause a problem. I might think differently if the rows list were to contain millions of items, but for the handful of items it will typically contain, there’s no reason to worry about the small additional load on the garbage collector.

That said, what I really wanted to write was:

    // NOTE: This code doesn't compile!
    var wt =
        rows.SelectMany(r =>
        {
            yield return r.WorkTypeKey;
            if (r.IsBreaksLimited)
                yield return r.BreakWorkTypeKey;
        });
    return wt.Distinct();

Unfortunately, that code doesn’t compile. For reasons that I won’t go into, you can’t use yield return in a Lambda or anonymous method. You can, however, write a separate method to do that, resulting in:

    private IEnumerable GetRowWorkTypes(Row row)
    {
        yield return row.WorkTypeKey;
        if (row.IsBreaksLimited)
            yield return row.BreakWorkTypeKey;
    }

    private IEnumerable GetTimeWorkedAndBreakWorkTypes(IEnumerable rows)
    {
        var wt = rows.SelectMany(GetRowWorkTypes);
        return wt.Distinct();
    }

I find it somewhat annoying that I have to create a separate method in order to do it this way, but the result is clear and reasonably concise.

I’m still finding out new things about LINQ. SelectMany is one of those things that I don’t need very often, but comes in very handy from time to time, this case being a prime example.