Imagine you have two lists of names in alphabetical order. The first list is of people who drive from Point A to Point B on Monday, and the second list is of people who drove that route the on Tuesday. You want to build a list of people who drove that route on either day, and you don’t want any duplicates. So if Joe, Mary, and Sam drove on Monday and Joe, Mary, and Ralph drove on Tuesday, then your list would contain [Joe, Mary, Ralph, Sam]
.
My MergeWithBy LINQ extension can easily merge the two lists to give [Joe, Joe, Mary, Mary, Ralph, Sam]
, but it doesn’t remove duplicates. I could make it remove duplicates easily enough by adding another parameter to the method, but it seems more useful to have a separate method that can remove duplicates from any sorted sequence.
LINQ’s existing Distinct method can do it, but it has at least two drawbacks:
- Because it works on unsorted sequences as well, it has to hold all of the items in memory. This limits the size of sequences you can operate on.
- It requires that I define an equality comparer if I want to do custom comparisons.
The first is a problem because I often work with very large lists that won’t fit into memory. The second is an annoyance because if I want to sort, merge, and then uniquify, I end up with two different meanings of equality: one defined by the IComparer
that I pass to OrderBy
and MergeWithBy
, and one defined by the IEqualityComparer
implementation that I pass to Distinct
. In addition, the syntax is different. That is, to merge and then uniquify:
var merged = List1.MergeWithBy(List2, k => k.Age).ThenBy(k => k.LastName);
var unique = merged.Distinct(new ItemComparerByAgeAndName());
I’d much rather have the second line be:
var unique = merged.UniqueBy(k => k.Age).ThenBy(k => k.LastName);
Removing duplicates from an ordered sequence is easy enough. For example, this code de-dupes an ordered list, using the passed comparer.
IEnumerable<TSource> Unique(IEnumerable<TSource> theList, IComparer<TSource> comparer)
{
var it = theList.GetEnumerator();
var hasItems = it.MoveNext();
if (!hasItems)
yield break;
yield return it.Current;
var prev = it.Current;
hasItems = it.MoveNext();
while (hasItems)
{
if (comparer.Compare(prev, it.Current) != 0)
{
yield return it.Current;
prev = it.Current;
}
hasItems = it.MoveNext();
}
}
Note that I used an IComparer<T> rather than an IEqualityComparer<T> here, which is different from what Distinct
uses. I did that because the next step is to implement a LINQ extension, called UniqueBy
, that works in the same way that OrderBy
and my MergeWithBy
methods work, including support for ThenBy
and ThenByDescending
.
I’m not going to spend any time describing how the UniqueBy
and related methods work. The inner workings are sufficiently similar to the previously covered MergeWithBy
that going over it again would be needless repetition. See the article I linked at the top of this post if you want more information.
UniqueBy
will be faster than Distinct
, and will use a lot less memory. Like my MergeWithBy
, it only needs enough extra memory to store three items at any one time. But it only works on ordered sequences. If you want to uniquify an unordered list, you need to call Distinct
, or you need to sort it first before calling UniqueBy
. But Distinct
will be faster if you have an unsorted list that fits in memory.
As I showed above, using the new method is really easy: it’s just like using OrderBy
or my TopBy
method.
One question that comes up is whether there is any benefit to calling UniqueBy
before merging sequences. That is, which of these two is better?
var foo = L1.UniqueBy(k => k).Merge(L2.UniqueBy(k => k), k => k).UniqueBy(k => k);
OR:
var foo = L1.Merge(L2, k => k).UniqueBy(k => k)
It should be clear that the final UniqueBy
is required in the first example. Otherwise you wouldn’t remove duplicates from the merged list. You’d only remove duplicates from each of the input lists. But if “Joe” existed in both lists then “Joe” would be output twice.
You’re better off with the second example because it’s simpler, produces the same result as the first, will use slightly less memory, and will be ever so slightly faster than the first example unless each of the input lists has a high percentage of duplicate items. Even in pathological cases, though, the difference in execution speed will likely be so small as to be irrelevant. So you should stick with the simpler code.
Whereas merging two sorted lists into a single sorted list and potentially removing duplicates is the most common type of merge I’ve used or seen used, it’s certainly not the only merge topic. Merging is a technique that’s been used in transaction processing systems for at least 50 years. I’ll likely come back to it in the future.
Following is complete code for the UniqueBy
extension.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
namespace EnumerableExtensions
{
public static class UniqueExtension
{
public static IOrderedEnumerable<TSource> UniqueBy<TSource, TKey>(
this IEnumerable<TSource> list1,
Func<TSource, TKey> keySelector)
{
return UniqueBy(list1, keySelector, null);
}
public static IOrderedEnumerable<TSource> UniqueBy<TSource, TKey>(
this IEnumerable<TSource> list1,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
{
return new UniqueOrderedEnumerable<TSource, TKey>(list1, keySelector, comparer, false);
}
public static IOrderedEnumerable<TSource> UniqueByDescending<TSource, TKey>(
this IEnumerable<TSource> list1,
Func<TSource, TKey> keySelector)
{
return UniqueByDescending(list1, keySelector, null);
}
public static IOrderedEnumerable<TSource> UniqueByDescending<TSource, TKey>(
this IEnumerable<TSource> list1,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
{
return new UniqueOrderedEnumerable<TSource, TKey>(list1, keySelector, comparer, true);
}
internal abstract class UniqueOrderedEnumerable<TSource> : IOrderedEnumerable<TSource>
{
private readonly IEnumerable<TSource> _list1;
internal UniqueOrderedEnumerable<TSource> Parent;
protected UniqueOrderedEnumerable(
IEnumerable<TSource> list1)
{
_list1 = list1;
}
public IOrderedEnumerable<TSource> CreateOrderedEnumerable<TKey>(
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer,
bool @descending)
{
var oe = new UniqueOrderedEnumerable<TSource, TKey>(
_list1, keySelector, comparer, descending) { Parent = this };
return oe;
}
public IEnumerator<TSource> GetEnumerator()
{
var criteria = GetCriteria().ToArray();
var selector = new UniqueSelector<TSource>(_list1, criteria);
return selector.DoSelect().GetEnumerator();
}
// Walks the ordering criteria to build an array that the HeapSelector can use.
private IEnumerable<UniqueOrderedEnumerable<TSource>> GetCriteria()
{
var keys = new Stack<UniqueOrderedEnumerable<TSource>>();
var current = this;
while (current != null)
{
keys.Push(current);
current = current.Parent;
}
return keys;
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
// The individual ordering criteria instances (UniqueOrderedEnumerable<TElement, TKey>)
// implement these abstract methods to provice key extraction, comparison, and swapping.
public abstract void ExtractKey(TSource item, int ix);
public abstract int CompareKeys(int x, int y);
}
internal class UniqueOrderedEnumerable<TSource, TKey> : UniqueOrderedEnumerable<TSource>
{
private readonly Func<TSource, TKey> _keySelector;
private readonly IComparer<TKey> _comparer;
private readonly bool _descending;
private readonly TKey[] _keys;
internal UniqueOrderedEnumerable(
IEnumerable<TSource> list1,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer,
bool descending)
: base(list1)
{
_keySelector = keySelector;
_comparer = comparer ?? Comparer<TKey>.Default;
_descending = descending;
_keys = new TKey[2];
}
public override int CompareKeys(int x, int y)
{
return _descending
? _comparer.Compare(_keys[y], _keys[x])
: _comparer.Compare(_keys[x], _keys[y]);
}
public override void ExtractKey(TSource item, int ix)
{
_keys[ix] = _keySelector(item);
}
}
internal class UniqueSelector<TSource>
{
private readonly IEnumerable<TSource> _list1;
private readonly UniqueOrderedEnumerable<TSource>[] _criteria;
public UniqueSelector(
IEnumerable<TSource> list1,
UniqueOrderedEnumerable<TSource>[] criteria)
{
_list1 = list1;
_criteria = criteria;
}
public IEnumerable<TSource> DoSelect()
{
// Initialize the iterator
var i1 = _list1.GetEnumerator();
// ix controls the key position that's loaded
var ix = 0;
var next = new Func<bool>(() =>
{
if (!i1.MoveNext()) return false;
ExtractKeys(i1.Current, ix);
return true;
});
var i1HasItems = next();
if (!i1HasItems) yield break;
// output the first item
yield return i1.Current;
ix = 1;
i1HasItems = next();
while (i1HasItems)
{
// Output the item if it's not equal to the previous item
if (Compare(0, 1) != 0)
{
yield return i1.Current;
// Change the index position for the next loaded key
ix = (ix == 0) ? 1 : 0;
}
i1HasItems = next();
}
}
private int Compare(int x, int y)
{
var rslt = 0;
for (var i = 0; rslt == 0 && i < _criteria.Length; ++i)
{
rslt = _criteria[i].CompareKeys(x, y);
}
return rslt;
}
private void ExtractKeys(TSource item, int ix)
{
foreach (var t in _criteria)
{
t.ExtractKey(item, ix);
}
}
}
}
}