Reading a csv file into a DataTable with Linq

Some times ago i had the requirement to read a csv file into a DataTable and doint it the classical iterative way seems boring to me so i thought about the make all with one linq statement princip and tried to accomplish the whole processing with one linq statement. That are my results i want to share. I stated with the following statement.

public DataTable ImportFirstAdept(FileInfo fileInfo, char splitChar, Encoding encoding)
{
    var plainfileName = Path.GetFileName(fileInfo.FullName);
    DataTable dataTable = new DataTable(plainfileName);
    var allRows = ((Func)(() => File.ReadAllLines(fileInfo.FullName, encoding)))();
    allRows.
        Select(str => str.Split(splitChar)).
        First().
        Select(columnHeader => { return new DataColumn(columnHeader.Trim()); }).
        Foreach(column =>
        {
            dataTable.Columns.Add(column);
            return column;
        }).
        ToList();
    allRows.
        Select(str => str.Split(splitChar)).
        Skip(1).
        Select(strArray =>
        {
            var strings = new List();
            foreach (var s in strArray)
                strings.Add(s.Trim());
            return strings.ToArray();
        }).
        Select(st => dataTable.Rows.Add(st)).
        ToList();
    return dataTable;
}

The follwoing test shows how it works.

public void CsvToDataTableTest(string path)
{
    DirectoryInfo directoryInfo = new DirectoryInfo(path);            
    CsvImport cvsImport = new CsvImport();
    DataSet dataSet = new DataSet();
    if (directoryInfo.Exists)
    {
        foreach (var file in directoryInfo.EnumerateFiles("*.csv"))
        {
            DataTable dataTable = cvsImport.ImportFirstAdept(file, '\t', Encoding.Default);
            dataSet.Tables.Add(dataTable);
        }
    }
}

[TestMethod]
public void CsvToDataTablePrequelsTest()
{
    CsvToDataTableTest(@"CsvToDataTableWithLinq\bin\Debug");
}

If we look at the ImportFirstAdept method it does not use one linq statement it uses two linq statements. So that was not sufficiently and i wrote some extension methodes to only use one Linq statement for the processing of the csv file.

public static class CsvImportExtension
{
    public static void If(this IEnumerable source, Func ifFunc, 
        Action ifAction, Action elseAction)
    {
        var counter = 1;
        List ifActions = new List();
        List elseActions = new List();
        foreach (var item in source)
        {
            if (ifFunc(item, counter))                                    
            {
                ifActions.Add(item);
            }                           
            else
                elseActions.Add(item);
            counter++;                
        }
        ifAction(ifActions.AsEnumerable());
        elseAction(elseActions.AsEnumerable());
    }

    public static void If(this IEnumerable source, Func ifFunc,
        Action ifAction, Action elseAction)
    {
        List ifActions = new List();
        List elseActions = new List();
        foreach (var item in source)
        {
            if (ifFunc(item))
            {
                ifActions.Add(item);
            }
            else
                elseActions.Add(item);
        }
        ifAction(ifActions.AsEnumerable());
        elseAction(elseActions.AsEnumerable());
    }

    public static IEnumerable Foreach(this IEnumerable source, Func foreachFunc)
    {
        foreach (var item in source)
        {
            yield return foreachFunc(item);
        }
    }

    public static void Foreach(this IEnumerable source, Action foreachAction)
    {
        foreach (var item in source)
        {
            foreachAction(item);
        }
    }
}

That is the ImportSecondAdept method that uses the if and foreach extension methodes to only use one linq statement to process the csv file.

public DataTable ImportSecondAdept(FileInfo fileInfo, char splitChar, Encoding encoding)
{
    var plainfileName = Path.GetFileName(fileInfo.FullName);
    var dataTable = new DataTable(plainfileName);
    ((Func)(() => File.ReadAllLines(fileInfo.FullName, encoding)))().
        Foreach(str => str.Split(splitChar)).
        If((t, i) => i == 1,
        t =>
        {
            t.First().
            Foreach(columnHeader => { return new DataColumn(columnHeader.Trim()); }).
            Foreach(column =>
            {
                dataTable.Columns.Add(column);
            });                    
        },
        t =>
        {
            t.
            Foreach(strArray =>
            {
                var strings = new List();
                foreach (var s in strArray)
                    strings.Add(s.Trim());
                return strings.ToArray();
            }).
            Foreach(str => dataTable.Rows.Add(str)).
            ToList();                    
        }
        );           
    return dataTable;
}

That is the test to try it and the result of the second adept method.

public void CsvToDataTableTest(string path)
{
    DirectoryInfo directoryInfo = new DirectoryInfo(path);            
    CsvImport cvsImport = new CsvImport();
    DataSet dataSet = new DataSet();
    if (directoryInfo.Exists)
    {
        foreach (var file in directoryInfo.EnumerateFiles("*.csv"))
        {
            DataTable dataTable = cvsImport.ImportSecondAdept(file, '\t', Encoding.Default);
            dataSet.Tables.Add(dataTable);
        }
    }
}

[TestMethod]
public void CsvToDataTablePrequelsTest()
{
    CsvToDataTableTest(@"CsvToDataTableWithLinq\bin\Debug");
}

csvToDataSetWithLinq

So now it is only one linq statement and the one linq statement princip is suffused.

 

Advertisements

Distinct a DataTable with c#

Today i want to write about the different possibilities to distinct the data in a DataTable. I had that problem some times ago and searched for a good solution. I found some different solutions and in that post i will show them. First i want to show the test i wrote to test the distinct of the DataTable and the data stored in the DataTable.

[TestMethod]
public void Distinct()
{
    DataTable dataTable = new DataTable("Distinct");
    var oldPK = new DataColumn() { ColumnName = "PK", DataType = typeof(int) };
    dataTable.Columns.Add(oldPK);
    var secondPK = new DataColumn() { ColumnName = "secondPK", DataType = typeof(int) };
    dataTable.Columns.Add(secondPK);
    dataTable.Columns.Add(new DataColumn() 
        { ColumnName = "firstname", DataType = typeof(string) });
    dataTable.Columns.Add(new DataColumn() 
        { ColumnName = "secondname", DataType = typeof(string) });
    dataTable.Columns.Add(new DataColumn() 
        { ColumnName = "socialSecurityNumber", DataType = typeof(int) });
    dataTable.PrimaryKey = new DataColumn[] { oldPK, secondPK };
    foreach (var number in Enumerable.Range(1, 3))
    {
        DataRow dataRow = dataTable.NewRow();
        dataRow["PK"] = number;
        dataRow["secondPK"] = number + 1;
        dataRow["firstname"] = "mathias";
        dataRow["secondname"] = "schober";
        dataRow["socialSecurityNumber"] = 4040;
        dataTable.Rows.Add(dataRow);
    }
    foreach (var number in Enumerable.Range(4, 2))
    {
        DataRow dataRow = dataTable.NewRow();
        dataRow["PK"] = number;
        dataRow["secondPK"] = number + 1;
        dataRow["firstname"] = "anna";
        dataRow["secondname"] = "schober";
        dataRow["socialSecurityNumber"] = 5050;
        dataTable.Rows.Add(dataRow);
    }
    var newDataTable = 
        dataTable.Distinct("socialSecurityNumber");
    var newDataViewDataTable = 
        dataTable.DistinctDataView("socialSecurityNumber");
    var newEqualityCompareDataTable = 
        dataTable.DistinctEqualityCompare("socialSecurityNumber");
    var newGenericEqualityCompareDataTable = 
        dataTable.DistinctGenericEqualityCompare("socialSecurityNumber");         
}

DistinctDataTable
As we see i wrote four different methods to distinct the DataTable. The first on Distinct is the classical way. I create a new DataTable, take the original DataTable, enumerate all rows and check if the value of the overtaken column (in the test i use socialSecurityNumber) is already in the new DataTable. Then i set the primary key of the new DataTable and reorder the columns to bring the primary key (overtaken column) to the first position. Here you see the code to do so.

public static class Distincting
{
    public static DataTable Distinct(this DataTable dataTable, 
        string columName)
    {
        DataColumn[] dataColumns = dataTable.PrimaryKey;
        if (dataColumns.Length >= 1)
        {
            foreach (DataColumn primaryKey in dataColumns)
            {
                if (primaryKey.ColumnName == columName)
                {
                    return dataTable;
                }                    
            }                
        }            
        return RebuildDataTable(dataTable, columName);
    }

    private static DataTable RebuildDataTable(DataTable dataTable, 
        string columName)
    {
        DataTable newDataTable = dataTable.Clone();
        DataColumn newDataColumn = newDataTable.Columns[columName];
        newDataTable.PrimaryKey = new DataColumn[] { newDataColumn };
        newDataTable.Columns[columName].SetOrdinal(0);
        dataTable.Rows.Foreach(dataRow =>
        {
            object pk = dataRow[columName];
            if (newDataTable.Rows.Find(pk) == null)
            {
                newDataTable.ImportRow(dataRow);
            }
        });
        ReorderPrimaryKey(dataTable, newDataTable);
        return newDataTable;
    }

    private static void ReorderPrimaryKey(DataTable dataTable, 
        DataTable newDataTable)
    {
        int maxOrdinal = 0;
        foreach (DataColumn column in newDataTable.Columns)
        {
            if (column.Ordinal > maxOrdinal)
                maxOrdinal = column.Ordinal;
        }
        DataColumn[] dataColumns = dataTable.PrimaryKey;
        if (dataColumns.Length >= 1)
        {
            foreach (DataColumn primaryKey in dataColumns)
            {
                newDataTable.Columns[primaryKey.ColumnName].
                    SetOrdinal(maxOrdinal);
            }
        }
    }
}
public static class ForEachEnumeration
{
    public static void Foreach(this DataRowCollection dataRows, 
Action action)
    {
        foreach(T dataRow in dataRows)
        {
            action(dataRow);
        }
    }
}

After distincting with the social sercurity number we have the following DataTable.
DistinctDataTableVersion1

The second possibility to distinct a DataTable is using the ToTable method of the DataView. In the next code part you can see how that works.

public static DataTable DistinctDataView(this DataTable dataTable,
        string columName)
{
    DataView view = new DataView(dataTable);
    return view.ToTable(true, new string[] { columName });
}

That is the resulting DataTable. As we see it contains only the social security number. The other columns are gone.
DistinctDataTableVersion2

After the classic version and the DataView version it is also possible to solve the problem with the linq Distinct Extension method.

public static DataTable DistinctEqualityCompare(this DataTable dataTable, 
    string columName)
{
    DataRowEqaulityComparer dataRowEqualityComparer = 
        new DataRowEqaulityComparer(columName);
    return dataTable.AsEnumerable().
        Distinct(dataRowEqualityComparer).CopyToDataTable();
}

To accomplish that we need a EqualityComparer.

public class DataRowEqualityComparer : IEqualityComparer
{
    private string distinctColumnName;

    public DataRowEqualityComparer(string distinctColumnName)
    {
        this.distinctColumnName = distinctColumnName;
    }

    public bool Equals(DataRow x, DataRow y)
    {
        if (x.Field<object>(distinctColumnName).
            Equals(y.Field<object>(distinctColumnName)))
        {
            return true;
        }
        return false;
    }

    public int GetHashCode(DataRow obj)
    {
        return obj.Field<object>(distinctColumnName).GetHashCode();
    }
}

Or we could also use a Generic EqualityComparer. (implementing a generic IEqualityComparer and IComparer class)

public static DataTable DistinctGenericEqualityCompare(this DataTable dataTable,
    string columName)
{
    EqualityComparer dataRowEqualityComparer = 
        new EqualityComparer(
        (x, y) => x.Field<object>(columName).Equals(y.Field<object>(columName)),
        x => x.Field<object>(columName).GetHashCode()
        );
    return dataTable.AsEnumerable().
        Distinct(dataRowEqualityComparer).CopyToDataTable();
}
public class EqualityComparer : IEqualityComparer
{
    private Func<T, T, bool> equalsFunction;
    private Func<T, int> getHashCodeFunction;

    public EqualityComparer(Func<T, T, bool> equalsFunction)
    {
        this.equalsFunction = equalsFunction;
    }

    public EqualityComparer(Func<T, T, bool> equalsFunction,
        Func<T, int> getHashCodeFunction)
        : this(equalsFunction)
    {
        this.getHashCodeFunction = getHashCodeFunction;
    }

    public bool Equals(T a, T b)
    {
        return equalsFunction(a, b);
    }

    public int GetHashCode(T obj)
    {
        if (getHashCodeFunction == null)
            return obj.GetHashCode();
        return getHashCodeFunction(obj);
    }
}

That is the resulting DataTable.
DistinctDataTableVersion3
That are all possibilities i could think of to distinct a DataTable.


Linq Index (part3 speeding up a linq query)

In this post i will first show some more complex LinqIndex queries. I want to start with some unequal where queries. In this example i want to query all Customers with a number that is bigger than a specific value.

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number).
    AddIndexKey(c => c.CustomerID);
var list = indexPart.Where(c => c.Number > 10).ToList();

When we look at this query we see that generating a hashtable and using it as index is useless here. If we have an equal where query (like c => c.Number == 10) we generate a hashtable and look in this hashtable for the value 10. But if we have the unequal query (like c => c.Number > 10) we have to lookup many values an we have only the starting value (10). To speed up such a query we need a tree as index. To be more specific we need a balanced tree to speed up the query. There are a lot of algorithms to build balanced trees. (avl trees, splay trees, red-black-trees …) I have choosen a avl tree because i implemented one some times ago and so i could use it for LinqIndex. The problem i had was that my implementation was too slow so i had to reimplement it. It is very importand that the insertion speed in the balanced tree is as fast a possible because i am always generating a new avl tree when an index is needed. So here is an example to show the performance with and without LinqIndex when querying on 100010 Customers.

[TestMethod]
public void UnequalIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number);                            
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition without index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition with unequal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition with unequal index second time " + stopWatch.ElapsedMilliseconds);
}

As we see this test executes the same query (with one million Customers) two times without an Index and then two times with an unequal index. This are the result times:

Where for unequal condition without index first time 231
Where for unequal condition without index second time 224
Where for unequal condition with unequal index first time 14738
Where for unequal condition with unequal index second time 0

So while the execution time of the standard Linq query is constant the execution time with the index drops from 14738 to 0 milliseconds. Here we see how long it takes to generate the avl tree. So it is only useful to use LinqIndex for such queries if it is possible to precreate the index or if the index gets used very often.

So now i have described the use of LinqIndex with simple where queries (equal and unequal compare), Join and GroupJoin. But what is with more complicated queries that uses and (&&)  and/or or (||) expressions. Implimenting such queries and gaining speed from the generated Index is very difficult because when combinating expressions (like c.Number == 10 && c.LastName == “Gottshall”) we have to decide if we use an existing index, generate a new index or don’t use an index. If we can speed up the query by using an index for the first part of the expression (c.Number == 10) that does not mean that it is a good decision to use an index for the second part of the expression (c.LastName == “Gottshall”). So complicated queries that uses LinqIndex are slower than the standard linq queries most of the time. Now i will show one example with a logical and (&&) and one with a logical or (||). So in this first example i will use a where on two properties. For both properties (c.Number and c.LastName) a index is defined.

[TestMethod]
public void EqualIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number)
        .AddIndexKey(c => c.LastName);                
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index second time " + stopWatch.ElapsedMilliseconds);
}

Again i am executing the same query two times without an Index and then two times with an index (one unequal and one equal). This are the result times:

Where without index first time 206
Where without index second time 199
Where with equal index first time 19859
Where with equal index second time 0

Again it was possible to bring the execution time with the index from 19859 to 0 milliseconds. If you execute such a query with LinqIndex it will first build a avl tree search for the Number in it, take the result build a hashtable out of it and search for the LastName in it. If you define no index for the LastName, LinqIndex would build the avl tree, search for Number in it and then it would use a standard linq where to find the LastName. If i use a logical or instead of the logical and the query would look as the following.

[TestMethod]
public void EqualIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number)
        .AddIndexKey(c => c.LastName);
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index second time " + stopWatch.ElapsedMilliseconds);
}

If we execute this query query two times without an Index and then two times with an index (one unequal and one equal) the execution times are different.

Where without index first time 337
Where without index second time 337
Where with equal index first time 24102
Where with equal index second time 742

Here LinqIndex is also slower than the standard where query the second time when executing. That happens because when using a logical or i have to find both values in both indexes and then i have to create a hashset to combine the results. That costs more time than traversing all items and check for both conditions. So when building where queries with logical or in it LinqIndex gains no speed and should not be used.

LinqIndex: (part1) (part2) (part3) (part4) (part5)


Linq Index (part2 using LinqIndex)

In this second post of my Linq Index series i will show how you can use LinqIndex to improve the performance of your linq to object queries You can download it from codeplex (LinqIndex). But at this point i want to warn you that there are many cases when the linq query without LinqIndex is much faster. LinqIndex is only faster for some special cases and normally only from a big amout of data. (hundred thousand or millions of items). To show the use of LinqIndex i want to start with a simple where query.

using LinqIndex;
public class Customer
{
    public string CustomerID { get; set; }
    public string LastName { get; set; }
    public int Number { get; set; }
    public int Age { get; set; }
}
public static void Test()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number);
    var list = indexPart.Where(i => i.Number == 1000000).ToList();
}

This query takes some sample data and creates an IndexDefiniation for the Customers. Now the IndexDefinition contains the customers and knows for what Property of Customers it should build an index (in our case for Number). Now we make a where on the indexPart. Here LinqIndex comes into play. It analyses the lambda expression ( i => i.Number == 100000) and reconises that the Number is registerd at index property so it builds a hashtable with all the Numbers of all Customers and makes a lookup in this hashtable. And here we have the first problem, the building of the hashtable takes some time and so if you execute a query the first time it is slower that a standard linq query. If you execute it the second time (using the index for number) it is much faster. So here is the first restrictions of LinqIndex, if you execute the query with the index just one time it is always faster to use the standard linq to object Where extension method. To ilustrat what i mean i show a test i wrote for LinqIndex.

[TestMethod]
public void EqualIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number);                
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(i => i.Number == 1000000).ToList();            
    stopWatch.Stop();    
    Debug.WriteLine("Where without index first time " + stopWatch.ElapsedMilliseconds);    
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(i => i.Number == 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number == 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();    
    list = indexPart.Where(i => i.Number == 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index second time " + stopWatch.ElapsedMilliseconds);
}

As we see this test executes the the same query (with one million Customers) two times without LinqIndex and then two times with LinqIndex. This are the result times:

Where without index first time 181
Where without index second time 184
Where with equal index first time 2571
Where with equal index second time 0

So while the execution time of the standard Linq query is constant the execution time with the index drops from 2571 to 0 milliseconds. Now i want to show some more complicated where queries that can gain perfomance from LinqIndex.

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.CustomerID).
    AddIndexKey(c => c.Number);
var list = indexPart.Where(c => c.CustomerID == "GOT1000" && c.Number == 1000).ToList();

Here LinqIndex uses a index for the first part of the query (c.CustomerId == “GOT1000”) and a second index for the second part of the query (c.Number == 1000). To do so LinqIndex takes the result from the first part of the query and builds a new index of this result and uses it in the second part of the query.

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.CustomerID).
    AddIndexKey(c => c.Number);
var list = indexPart.Where(c => c.CustomerID == "GOT1000" || c.Number == 1000).ToList();

Here LinqIndex uses a index for both parts of the query and combinates the two parts to a result set. It is also possible to build queries that gain from LinqIndex with unequal operators (>, >=, <, =<).

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.CustomerID).
    AddIndexKey(c => c.Number);
var list = indexPart.Where(i => i.Number > 1000000).ToList();

Here LinqIndex has one restrictions, it works only with integers that means i could not make this query with CustomerId. (maybe in the next version i will extend this functionality).

LinqIndex does also speed up Join and GroupJoin here are the definitions of the extension Methodes:

public static class IndexDefinitionExtensions
{
    public static IEnumerable<TResult> Join<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, TInner, TResult> resultSelector)

    public static IEnumerable<TResult> Join<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, TInner, TResult> resultSelector, 
        IEqualityComparer<TKey> equalityComparer)

    public static IEnumerable<TResult> GroupJoin<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, IEnumerable<TInner>, TResult> resultSelector)

    public static IEnumerable<TResult> GroupJoin<TKey, TSource, TInner, TResult>(
        this IndexDefinition<TSource> source,
        IEnumerable<TInner> inner, Func<TInner, TKey> innerKeySelector,
        Expression<Func<TSource, TKey>> outerKeySelector,
        Func<TSource, IEnumerable<TInner>, TResult> resultSelector,
        IEqualityComparer<TKey> equalityComparer)

    public static IEnumerable<TSource> Where<TSource>(
        this IndexDefinition<TSource> source,
        Expression<Func<TSource, bool>> predicate)
}

The next examples show how you could use LinqIndex to speed up Joins and GroupJoins. To do so i need a second IEnumerable to join with. In my case that are Orders.

public class Order
{
    public string CustomerID { get; set; }
    public string OrderNumber { get; set; }
}
var customers = Customer.SampleData();
var indexPart = customers.CreateIndexKey(c => c.CustomerID).
   AddIndexKey(c => c.Number);
var orders = Order.SampleData();
var result1 = indexPart.Join(orders,
                c => c.CustomerID,
                o => o.CustomerID,
                (c, o) => new { LastName = c.LastName, 
                Orders = o.OrderNumber }).ToList();

In this example the Join is using the CustomerID (a property of Customer i defined an index for) to build the index and to join it with the IEnumerable<Orders> collection. This also works with GroupJoin:

var customers = Customer.SampleData();
var indexPart = customers.CreateIndexKey(c => c.CustomerID).
       AddIndexKey(c => c.Number);
var orders = Order.SampleData();
var result1 = indexPart.GroupJoin(orders,
       c => c.CustomerID,
       o => o.CustomerID,
       (c, o) => new { LastName = c.LastName, Orders = o.Count() }).ToList();

So if you use a property in the innerKeySelector and for this property a index is defined then LinqIndex uses this index definition to build a index based at that property. If you have a lot of data this could improve the speed of your Join and GroupJoin queries. In my next post i will write about some considerations when generating an index for linq to object and how i implemented some of them in LinqIndex.

LinqIndex: (part1) (part2) (part3) (part4) (part5)


Linq Index (part1 introduction)

About two years ago i worked with some linq to object queries that processed a lot of data. There i thought about implementing some indexing strategy for Linq to objects. I tried some things but then i lost interest. About half a year ago i had the same problem again and so i started thinking about this indexing thing again. So i started to implement the LinqIndex project. You can download it from codeplex (LinqIndex) (the project contains the test i wrote to check the perfomance). In this and the next blog entries i will write about the problems i had while implementing it, how i solved some of this problems and i will show how to use LinqIndex.
For introduction i have to say that the LinqIndex project should not be used very carefully in productive code because there are many cases that are not covered by LinqIndex and the error handling is very poor.
So first of all i will show the structure of LinqIndex and what parts it contains.
The most important parts of LinqIndex are the IndexSpecification and the IndexDefinitionExtensions. This image shows the uml diagramm of this two classes.

With the IndexSpecification class it is possible to specifie an index. The IndexDefinitionExtension contains the Where, Join and GroupJoin extension methodes to use the spefied index columns. If you want to use LinqIndex that are the two classes you need. But to understand what LinqIndex does we need the classes from the LinqIndex.IndexBuilding namespace.

In the LinqIndex.IndexBuilding namespace the most importand classes are the EqualIndex, Index classes that represent the created index. In my next post i will show how you could use LinqIndex to speed up linq queries.

LinqIndex: (part1) (part2) ((part3)) (part4) (part5)


implementing a generic IEqualityComparer and IComparer class

Today i will write about the generic implementation of IComparer and IEqualityComparer. I read about this topic some times ago and now i wrote a linq join where i needed a specific EqualityComparer. So i decided to implement a generic version. The base of the Implementation is the generic IEqualityComparer<T> interface,

public interface IEqualityComparer<in T>
{
  bool Equals(T x, T y);
  bool int GetHashCode(T x);
}

and the generic IComparer<T> interface.

public interface IComparer<in T>
{
  int Compare(T x,T y)
}

If you use the IEqualityComparer<in T> interface you have to implement the Equals method that checks the equality of the two overtaken objects of type T and you have to implement the GetHashCode method that generates a customized  hashcode for the comparer object. This IEqualityComparer interface is widley used in Linq. For example at the Join, GroupJoin or Except extension methodes.

If you use the IComparer<in T> inferface you have to implement the Compare method that takes two objects of type T and compares them. If a is smaller than b it returns -1 if a and b are equal it returns 0 and if a is bigger than b it returns 1. The IComparer interface is used to sort datastructures. The List<T>.Sort method takes a comparer to decide how to sort a list.

The generic implemetation of the two interfaces is very easy. We use Func delegates to overtake the methodes dynmically and call that delegates if the Equals, GetHashCode and Compare methodes are called. Here is the implementation of the EqualityComparer.

public class EqualityComparer<T> : IEqualityComparer<T>
{
    private Func<T, T, bool> equalsFunction;
    private Func<T, int> getHashCodeFunction;

    public EqualityComparer(Func<T, T, bool> equalsFunction)
    {
        this.equalsFunction = equalsFunction;
    }

    public EqualityComparer(Func<T, T, bool> equalsFunction, 
        Func<T, int> getHashCodeFunction) : this(equalsFunction)
    {
        this.getHashCodeFunction = getHashCodeFunction;
    }

    public bool Equals(T a, T b)
    {
        return equalsFunction(a, b);
    }

    public int GetHashCode(T obj)
    {
        if (getHashCodeFunction == null)
            return obj.GetHashCode();
        return getHashCodeFunction(obj);
    }
}

And the use of the Equality Comparer:

List<Person> firstPersonList = new List<Person>() 
    { new Person() { Name = "mathias", Age = 36 }, 
        new Person() { Name = "karoline", Age = 35 } };

List<Person> secondPersonList = new List<Person>() { 
    new Person() { Name = "karl", Age = 36 }, 
    new Person() { Name = "sepp", Age = 36 }, 
    new Person() { Name = "emil", Age = 35 }, 
    new Person() { Name = "mia", Age = 35 }, 
    new Person() { Name = "hans", Age = 36 } };

EqualityComparer<Person> equalityComparer = new
    EqualityComparer<Person>((s1, s2) => s1.Age == s2.Age);

var groupJoinItems = firstPersonList.GroupJoin(
    secondPersonList, 
    p => p, 
    p => p,
    (p1, p2) => new { Name = p1.Name, Person = p2 }, 
    equalityComparer);

foreach (var item in groupJoinItems)
{
    Console.WriteLine("Name {0}", item.Name);
    foreach (var groupItem in item.Person)
    {
        var secondname = groupItem.Name;
        Console.WriteLine(" Name {0} / Age {1}", 
            groupItem.Name, groupItem.Age);
    }
}

Result:

Here is the implementation of the Comparer.

public class Comparer<T> : IComparer<T>
{
    private Func<T, T, int> comparerFunction;

    public Comparer(Func<T, T, int> comparerFunction)
    {
        this.comparerFunction = comparerFunction;
    }

    public int Compare(T a, T b)
    {
        if (object.ReferenceEquals(a, b))
            return 0;
        return comparerFunction(a, b);
    }
}

And the use of the Comparer<T>

List<Person> personList = new List<Person>() { 
    new Person() { Name = "karl", Age = 36 }, 
    new Person() { Name = "sepp", Age = 23 }, 
    new Person() { Name = "emil", Age = 3 }, 
    new Person() { Name = "mia", Age = 2 }, 
    new Person() { Name = "hans", Age = 26 } };

Comparer<Person> comparer =
    new Comparer<Person>(
        (p1, p2) => (p1.Age < p2.Age) ? -1 : (p1.Age == p2.Age) ? 0 : 1);        

personList.Sort(comparer);

foreach (var item in personList)
{
    Console.WriteLine("Name {0} / Age {1}", item.Name, item.Age);                
}

Result:

Using this generic version of IEqualityComparer<T> and IComparer<T> has two benefits. The first benefit is that you dont need to implement objects that implement the interfaces over and over again. The second benefit is that you can use lambda expression syntax to generate the body of the Equals and Compare methodes.


implementing fold in c#

Reading about folding in haskell, i thought about implementing this in c#. First of all i want to describe right and left folding in haskell. Folding means that we are executing an operation to every element of a list. So in the next example we subtract all elements of a list. With left folding (foldl) we start at the beginning of the list with our operation, with right folding (foldr) we start at the end of the list with our operations. Here are two haskell examples for left folding and right folding.

foldl (-) [1,2,3,4,5]

> -13

foldr (-) [1,2,3,4,5]

> -5

As we see calculation of foldl starts at the left side of the list (1-2-3-4-5 = -13), foldr starts at the right side of the list (5-4-3-2-1 = -5). In c# left folding (foldl) is implemented in the Aggregate extension method. This example shows the use of Aggreagate:

var value = (new[] { 1, 2, 3, 4, 5 }).Aggregate((a, b) => a - b);
Assert.AreEqual(-13, value);

Now we start implementing FoldL. This first implementation works exact  as the Aggreagte method.

var value = (new[] { 1, 2, 3, 4, 5 }).FoldL((a, b) => a - b);
Assert.AreEqual(-13, value);

As we see in the implementation of FoldL we run through the list and calculate the value by calling the func delegate.

public static T FoldL<T>(this IEnumerable<T> list, Func<T, T, T> func)
{
    bool containsValue = false;
    T firstValue = default(T);
    bool firstValueSet = false;
    IEnumerator<T> iEnumerator = list.GetEnumerator();
    while(iEnumerator.MoveNext())
    {
        containsValue = true;
        T tempValue = iEnumerator.Current;
        if(firstValueSet)
            tempValue = func(firstValue, tempValue);
        firstValueSet = true;
        firstValue = tempValue;
    }
    if(containsValue)
        return firstValue;
    return default(T);
}

For the right folding there is no standard implentation in Linq. For that reason we will implement it here:

var value = (new[] { 1, 2, 3, 4, 5 }).FoldR((a, b) => a - b);
Assert.AreEqual(-5, value);
public static T FoldR<T>(this IEnumerable<T> list, Func<T, T, T> func)
{
    Stack<T> stack = new Stack<T>();
    T firstValue = default(T);
    bool containsValue = false;
    bool firstValueSet = false;
    T tempValue = default(T);
    foreach (var item in list)
    {
        stack.Push(item);
    }
    while (stack.Count() > 0)
    {
        firstValue = stack.Pop();
        containsValue = true;
        if (firstValueSet)
        {
            tempValue = func(tempValue, firstValue);
        }
        else
        {
            firstValueSet = true;
            tempValue = firstValue;
        }                
    }            
    if (containsValue)
        return tempValue;
    return default(T);
}

Now we want to build foldl and foldr a little more haskell style. For that we use a string that contains the operation (for example “-“) and we overtake the list we want to fold. So the calls for foldl and foldr are:

var result = Fold.Fold.FoldL("-", new[] { 1, 2, 3, 4, 5 });
Assert.AreEqual<int>(-13, result);
var result = Fold.Fold.FoldR("-", new[] { 1, 2, 3, 4, 5 });
Assert.AreEqual<int>(-5, result);

The implementation of foldL and foldR works with the dynamic generation of a expressiontree. In case which operation (“+”,”-“,”*”,”/”,”^”) is overtaken it generates a dynamic lambda function (Func<T,T,T>). This dynamic function is then called for every value in the list.

public static T FoldL<T>(string binaryOperator, T[] list)
{
    ParameterExpression startValueParameter = Expression.Parameter(typeof(T), "startValue");
    ParameterExpression listValueParameter = Expression.Parameter(typeof(T), "listValue");
    Func<T, T, T> func = Expression.Lambda<Func<T, T, T>>
        (CreateBinaryExpression(binaryOperator, startValueParameter, listValueParameter),
        new ParameterExpression[] { startValueParameter, listValueParameter }).Compile();
    if (list.Count() == 0)
    {
        return default(T);
    }
    else if (list.Count() == 1)
    {
        return list[0];
    }
    else
    {
        T tempValue = list[0];
        for (int i = 1; i < list.Count(); i++)
        {
            tempValue = func(tempValue, list[i]);
        }
        return tempValue;
    }
}

FoldR works like foldL but the list is iterated form the back to the front.

public static T FoldR<T>(string binaryOperator, T[] list)
{
    ParameterExpression startValueParameter = Expression.Parameter(typeof(T), "startValue");
    ParameterExpression listValueParameter = Expression.Parameter(typeof(T), "listValue");
    Func<T, T, T> func = Expression.Lambda<Func<T, T, T>>
        (CreateBinaryExpression(binaryOperator, startValueParameter, listValueParameter),
        new ParameterExpression[] { startValueParameter, listValueParameter }).Compile();
    if (list.Count() == 0)
    {
        return default(T);
    }
    else if (list.Count() == 1)
    {
        return list[0];
    }
    else
    {
        T tempValue = list[list.Count() - 1];
        for (int i = list.Count() - 2; i >= 0; i--)
        {
            tempValue = func(tempValue, list[i]);
        }
        return tempValue;
    }
}

Here is the CreateBinaryExpression method that converts the overtaken string (for example “-“) to a binary expression and adds the startValueParameter and the listValueParameter to the binaryExpression.

private static BinaryExpression CreateBinaryExpression(string binaryOperator, 
    ParameterExpression startValueParameter,
    ParameterExpression listValueParameter)
{
    BinaryExpression binaryExpression = null;
    switch (binaryOperator)
    {
        case "+":
            binaryExpression = Expression.Add(startValueParameter, listValueParameter);
            break;
        case "-":
            binaryExpression = Expression.Subtract(startValueParameter, listValueParameter);
            break;
        case "*":
            binaryExpression = Expression.Multiply(startValueParameter, listValueParameter);
            break;
        case "/":
            binaryExpression = Expression.Divide(startValueParameter, listValueParameter);
            break;
        case "^":
            binaryExpression = Expression.Power(startValueParameter, listValueParameter);
            break;
        default:
            throw new Exception("no binary Operator specified.");
    }
    return binaryExpression;
}

So one could argue that foldl and foldr in haskell are much more powerfull than just using addition, subtraction, multiblication, division and square and thats right but nevertheless i like that haskell style in c#.