Linq Index (part2 using LinqIndex)

In this second post of my Linq Index series i will show how you can use LinqIndex to improve the performance of your linq to object queries You can download it from codeplex (LinqIndex). But at this point i want to warn you that there are many cases when the linq query without LinqIndex is much faster. LinqIndex is only faster for some special cases and normally only from a big amout of data. (hundred thousand or millions of items). To show the use of LinqIndex i want to start with a simple where query.

using LinqIndex;
public class Customer
{
    public string CustomerID { get; set; }
    public string LastName { get; set; }
    public int Number { get; set; }
    public int Age { get; set; }
}
public static void Test()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number);
    var list = indexPart.Where(i => i.Number == 1000000).ToList();
}

This query takes some sample data and creates an IndexDefiniation for the Customers. Now the IndexDefinition contains the customers and knows for what Property of Customers it should build an index (in our case for Number). Now we make a where on the indexPart. Here LinqIndex comes into play. It analyses the lambda expression ( i => i.Number == 100000) and reconises that the Number is registerd at index property so it builds a hashtable with all the Numbers of all Customers and makes a lookup in this hashtable. And here we have the first problem, the building of the hashtable takes some time and so if you execute a query the first time it is slower that a standard linq query. If you execute it the second time (using the index for number) it is much faster. So here is the first restrictions of LinqIndex, if you execute the query with the index just one time it is always faster to use the standard linq to object Where extension method. To ilustrat what i mean i show a test i wrote for LinqIndex.

[TestMethod]
public void EqualIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number);                
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(i => i.Number == 1000000).ToList();            
    stopWatch.Stop();    
    Debug.WriteLine("Where without index first time " + stopWatch.ElapsedMilliseconds);    
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(i => i.Number == 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number == 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();    
    list = indexPart.Where(i => i.Number == 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index second time " + stopWatch.ElapsedMilliseconds);
}

As we see this test executes the the same query (with one million Customers) two times without LinqIndex and then two times with LinqIndex. This are the result times:

Where without index first time 181
Where without index second time 184
Where with equal index first time 2571
Where with equal index second time 0

So while the execution time of the standard Linq query is constant the execution time with the index drops from 2571 to 0 milliseconds. Now i want to show some more complicated where queries that can gain perfomance from LinqIndex.

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.CustomerID).
    AddIndexKey(c => c.Number);
var list = indexPart.Where(c => c.CustomerID == "GOT1000" && c.Number == 1000).ToList();

Here LinqIndex uses a index for the first part of the query (c.CustomerId == “GOT1000”) and a second index for the second part of the query (c.Number == 1000). To do so LinqIndex takes the result from the first part of the query and builds a new index of this result and uses it in the second part of the query.

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.CustomerID).
    AddIndexKey(c => c.Number);
var list = indexPart.Where(c => c.CustomerID == "GOT1000" || c.Number == 1000).ToList();

Here LinqIndex uses a index for both parts of the query and combinates the two parts to a result set. It is also possible to build queries that gain from LinqIndex with unequal operators (>, >=, <, =<).

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.CustomerID).
    AddIndexKey(c => c.Number);
var list = indexPart.Where(i => i.Number > 1000000).ToList();

Here LinqIndex has one restrictions, it works only with integers that means i could not make this query with CustomerId. (maybe in the next version i will extend this functionality).

LinqIndex does also speed up Join and GroupJoin here are the definitions of the extension Methodes:

public static class IndexDefinitionExtensions
{
    public static IEnumerable<TResult> Join<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, TInner, TResult> resultSelector)

    public static IEnumerable<TResult> Join<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, TInner, TResult> resultSelector, 
        IEqualityComparer<TKey> equalityComparer)

    public static IEnumerable<TResult> GroupJoin<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, IEnumerable<TInner>, TResult> resultSelector)

    public static IEnumerable<TResult> GroupJoin<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, IEnumerable<TInner>, TResult> resultSelector,
        IEqualityComparer<TKey> equalityComparer)

    public static IEnumerable<TSource> Where<TSource>(
        this IndexDefinition<TSource> source,
        Expression<Func<TSource, bool>> predicate)
}

The next examples show how you could use LinqIndex to speed up Joins and GroupJoins. To do so i need a second IEnumerable to join with. In my case that are Orders.

public class Order
{
    public string CustomerID { get; set; }
    public string OrderNumber { get; set; }
}
var customers = Customer.SampleData();
var indexPart = customers.CreateIndexKey(c => c.CustomerID).
   AddIndexKey(c => c.Number);
var orders = Order.SampleData();
var result1 = indexPart.Join(orders,
                c => c.CustomerID,
                o => o.CustomerID,
                (c, o) => new { LastName = c.LastName, 
                Orders = o.OrderNumber }).ToList();

In this example the Join is using the CustomerID (a property of Customer i defined an index for) to build the index and to join it with the IEnumerable<Orders> collection. This also works with GroupJoin:

var customers = Customer.SampleData();
var indexPart = customers.CreateIndexKey(c => c.CustomerID).
       AddIndexKey(c => c.Number);
var orders = Order.SampleData();
var result1 = indexPart.GroupJoin(orders,
       c => c.CustomerID,
       o => o.CustomerID,
       (c, o) => new { LastName = c.LastName, Orders = o.Count() }).ToList();

So if you use a property in the innerKeySelector and for this property a index is defined then LinqIndex uses this index definition to build a index based at that property. If you have a lot of data this could improve the speed of your Join and GroupJoin queries. In my next post i will write about some considerations when generating an index for linq to object and how i implemented some of them in LinqIndex.

LinqIndex: (part1) (part2) (part3) (part4) (part5)

Advertisements


If you have a note or a question please write a comment.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s