Reading a csv file into a DataTable with Linq

Some times ago i had the requirement to read a csv file into a DataTable and doint it the classical iterative way seems boring to me so i thought about the make all with one linq statement princip and tried to accomplish the whole processing with one linq statement. That are my results i want to share. I stated with the following statement.

public DataTable ImportFirstAdept(FileInfo fileInfo, char splitChar, Encoding encoding)
{
    var plainfileName = Path.GetFileName(fileInfo.FullName);
    DataTable dataTable = new DataTable(plainfileName);
    var allRows = ((Func)(() => File.ReadAllLines(fileInfo.FullName, encoding)))();
    allRows.
        Select(str => str.Split(splitChar)).
        First().
        Select(columnHeader => { return new DataColumn(columnHeader.Trim()); }).
        Foreach(column =>
        {
            dataTable.Columns.Add(column);
            return column;
        }).
        ToList();
    allRows.
        Select(str => str.Split(splitChar)).
        Skip(1).
        Select(strArray =>
        {
            var strings = new List();
            foreach (var s in strArray)
                strings.Add(s.Trim());
            return strings.ToArray();
        }).
        Select(st => dataTable.Rows.Add(st)).
        ToList();
    return dataTable;
}

The follwoing test shows how it works.

public void CsvToDataTableTest(string path)
{
    DirectoryInfo directoryInfo = new DirectoryInfo(path);            
    CsvImport cvsImport = new CsvImport();
    DataSet dataSet = new DataSet();
    if (directoryInfo.Exists)
    {
        foreach (var file in directoryInfo.EnumerateFiles("*.csv"))
        {
            DataTable dataTable = cvsImport.ImportFirstAdept(file, '\t', Encoding.Default);
            dataSet.Tables.Add(dataTable);
        }
    }
}

[TestMethod]
public void CsvToDataTablePrequelsTest()
{
    CsvToDataTableTest(@"CsvToDataTableWithLinq\bin\Debug");
}

If we look at the ImportFirstAdept method it does not use one linq statement it uses two linq statements. So that was not sufficiently and i wrote some extension methodes to only use one Linq statement for the processing of the csv file.

public static class CsvImportExtension
{
    public static void If(this IEnumerable source, Func ifFunc, 
        Action ifAction, Action elseAction)
    {
        var counter = 1;
        List ifActions = new List();
        List elseActions = new List();
        foreach (var item in source)
        {
            if (ifFunc(item, counter))                                    
            {
                ifActions.Add(item);
            }                           
            else
                elseActions.Add(item);
            counter++;                
        }
        ifAction(ifActions.AsEnumerable());
        elseAction(elseActions.AsEnumerable());
    }

    public static void If(this IEnumerable source, Func ifFunc,
        Action ifAction, Action elseAction)
    {
        List ifActions = new List();
        List elseActions = new List();
        foreach (var item in source)
        {
            if (ifFunc(item))
            {
                ifActions.Add(item);
            }
            else
                elseActions.Add(item);
        }
        ifAction(ifActions.AsEnumerable());
        elseAction(elseActions.AsEnumerable());
    }

    public static IEnumerable Foreach(this IEnumerable source, Func foreachFunc)
    {
        foreach (var item in source)
        {
            yield return foreachFunc(item);
        }
    }

    public static void Foreach(this IEnumerable source, Action foreachAction)
    {
        foreach (var item in source)
        {
            foreachAction(item);
        }
    }
}

That is the ImportSecondAdept method that uses the if and foreach extension methodes to only use one linq statement to process the csv file.

public DataTable ImportSecondAdept(FileInfo fileInfo, char splitChar, Encoding encoding)
{
    var plainfileName = Path.GetFileName(fileInfo.FullName);
    var dataTable = new DataTable(plainfileName);
    ((Func)(() => File.ReadAllLines(fileInfo.FullName, encoding)))().
        Foreach(str => str.Split(splitChar)).
        If((t, i) => i == 1,
        t =>
        {
            t.First().
            Foreach(columnHeader => { return new DataColumn(columnHeader.Trim()); }).
            Foreach(column =>
            {
                dataTable.Columns.Add(column);
            });                    
        },
        t =>
        {
            t.
            Foreach(strArray =>
            {
                var strings = new List();
                foreach (var s in strArray)
                    strings.Add(s.Trim());
                return strings.ToArray();
            }).
            Foreach(str => dataTable.Rows.Add(str)).
            ToList();                    
        }
        );           
    return dataTable;
}

That is the test to try it and the result of the second adept method.

public void CsvToDataTableTest(string path)
{
    DirectoryInfo directoryInfo = new DirectoryInfo(path);            
    CsvImport cvsImport = new CsvImport();
    DataSet dataSet = new DataSet();
    if (directoryInfo.Exists)
    {
        foreach (var file in directoryInfo.EnumerateFiles("*.csv"))
        {
            DataTable dataTable = cvsImport.ImportSecondAdept(file, '\t', Encoding.Default);
            dataSet.Tables.Add(dataTable);
        }
    }
}

[TestMethod]
public void CsvToDataTablePrequelsTest()
{
    CsvToDataTableTest(@"CsvToDataTableWithLinq\bin\Debug");
}

csvToDataSetWithLinq

So now it is only one linq statement and the one linq statement princip is suffused.

 

Advertisements

Distinct a DataTable with c#

Today i want to write about the different possibilities to distinct the data in a DataTable. I had that problem some times ago and searched for a good solution. I found some different solutions and in that post i will show them. First i want to show the test i wrote to test the distinct of the DataTable and the data stored in the DataTable.

[TestMethod]
public void Distinct()
{
    DataTable dataTable = new DataTable("Distinct");
    var oldPK = new DataColumn() { ColumnName = "PK", DataType = typeof(int) };
    dataTable.Columns.Add(oldPK);
    var secondPK = new DataColumn() { ColumnName = "secondPK", DataType = typeof(int) };
    dataTable.Columns.Add(secondPK);
    dataTable.Columns.Add(new DataColumn() 
        { ColumnName = "firstname", DataType = typeof(string) });
    dataTable.Columns.Add(new DataColumn() 
        { ColumnName = "secondname", DataType = typeof(string) });
    dataTable.Columns.Add(new DataColumn() 
        { ColumnName = "socialSecurityNumber", DataType = typeof(int) });
    dataTable.PrimaryKey = new DataColumn[] { oldPK, secondPK };
    foreach (var number in Enumerable.Range(1, 3))
    {
        DataRow dataRow = dataTable.NewRow();
        dataRow["PK"] = number;
        dataRow["secondPK"] = number + 1;
        dataRow["firstname"] = "mathias";
        dataRow["secondname"] = "schober";
        dataRow["socialSecurityNumber"] = 4040;
        dataTable.Rows.Add(dataRow);
    }
    foreach (var number in Enumerable.Range(4, 2))
    {
        DataRow dataRow = dataTable.NewRow();
        dataRow["PK"] = number;
        dataRow["secondPK"] = number + 1;
        dataRow["firstname"] = "anna";
        dataRow["secondname"] = "schober";
        dataRow["socialSecurityNumber"] = 5050;
        dataTable.Rows.Add(dataRow);
    }
    var newDataTable = 
        dataTable.Distinct("socialSecurityNumber");
    var newDataViewDataTable = 
        dataTable.DistinctDataView("socialSecurityNumber");
    var newEqualityCompareDataTable = 
        dataTable.DistinctEqualityCompare("socialSecurityNumber");
    var newGenericEqualityCompareDataTable = 
        dataTable.DistinctGenericEqualityCompare("socialSecurityNumber");         
}

DistinctDataTable
As we see i wrote four different methods to distinct the DataTable. The first on Distinct is the classical way. I create a new DataTable, take the original DataTable, enumerate all rows and check if the value of the overtaken column (in the test i use socialSecurityNumber) is already in the new DataTable. Then i set the primary key of the new DataTable and reorder the columns to bring the primary key (overtaken column) to the first position. Here you see the code to do so.

public static class Distincting
{
    public static DataTable Distinct(this DataTable dataTable, 
        string columName)
    {
        DataColumn[] dataColumns = dataTable.PrimaryKey;
        if (dataColumns.Length >= 1)
        {
            foreach (DataColumn primaryKey in dataColumns)
            {
                if (primaryKey.ColumnName == columName)
                {
                    return dataTable;
                }                    
            }                
        }            
        return RebuildDataTable(dataTable, columName);
    }

    private static DataTable RebuildDataTable(DataTable dataTable, 
        string columName)
    {
        DataTable newDataTable = dataTable.Clone();
        DataColumn newDataColumn = newDataTable.Columns[columName];
        newDataTable.PrimaryKey = new DataColumn[] { newDataColumn };
        newDataTable.Columns[columName].SetOrdinal(0);
        dataTable.Rows.Foreach(dataRow =>
        {
            object pk = dataRow[columName];
            if (newDataTable.Rows.Find(pk) == null)
            {
                newDataTable.ImportRow(dataRow);
            }
        });
        ReorderPrimaryKey(dataTable, newDataTable);
        return newDataTable;
    }

    private static void ReorderPrimaryKey(DataTable dataTable, 
        DataTable newDataTable)
    {
        int maxOrdinal = 0;
        foreach (DataColumn column in newDataTable.Columns)
        {
            if (column.Ordinal > maxOrdinal)
                maxOrdinal = column.Ordinal;
        }
        DataColumn[] dataColumns = dataTable.PrimaryKey;
        if (dataColumns.Length >= 1)
        {
            foreach (DataColumn primaryKey in dataColumns)
            {
                newDataTable.Columns[primaryKey.ColumnName].
                    SetOrdinal(maxOrdinal);
            }
        }
    }
}
public static class ForEachEnumeration
{
    public static void Foreach(this DataRowCollection dataRows, 
Action action)
    {
        foreach(T dataRow in dataRows)
        {
            action(dataRow);
        }
    }
}

After distincting with the social sercurity number we have the following DataTable.
DistinctDataTableVersion1

The second possibility to distinct a DataTable is using the ToTable method of the DataView. In the next code part you can see how that works.

public static DataTable DistinctDataView(this DataTable dataTable,
        string columName)
{
    DataView view = new DataView(dataTable);
    return view.ToTable(true, new string[] { columName });
}

That is the resulting DataTable. As we see it contains only the social security number. The other columns are gone.
DistinctDataTableVersion2

After the classic version and the DataView version it is also possible to solve the problem with the linq Distinct Extension method.

public static DataTable DistinctEqualityCompare(this DataTable dataTable, 
    string columName)
{
    DataRowEqaulityComparer dataRowEqualityComparer = 
        new DataRowEqaulityComparer(columName);
    return dataTable.AsEnumerable().
        Distinct(dataRowEqualityComparer).CopyToDataTable();
}

To accomplish that we need a EqualityComparer.

public class DataRowEqualityComparer : IEqualityComparer
{
    private string distinctColumnName;

    public DataRowEqualityComparer(string distinctColumnName)
    {
        this.distinctColumnName = distinctColumnName;
    }

    public bool Equals(DataRow x, DataRow y)
    {
        if (x.Field<object>(distinctColumnName).
            Equals(y.Field<object>(distinctColumnName)))
        {
            return true;
        }
        return false;
    }

    public int GetHashCode(DataRow obj)
    {
        return obj.Field<object>(distinctColumnName).GetHashCode();
    }
}

Or we could also use a Generic EqualityComparer. (implementing a generic IEqualityComparer and IComparer class)

public static DataTable DistinctGenericEqualityCompare(this DataTable dataTable,
    string columName)
{
    EqualityComparer dataRowEqualityComparer = 
        new EqualityComparer(
        (x, y) => x.Field<object>(columName).Equals(y.Field<object>(columName)),
        x => x.Field<object>(columName).GetHashCode()
        );
    return dataTable.AsEnumerable().
        Distinct(dataRowEqualityComparer).CopyToDataTable();
}
public class EqualityComparer : IEqualityComparer
{
    private Func<T, T, bool> equalsFunction;
    private Func<T, int> getHashCodeFunction;

    public EqualityComparer(Func<T, T, bool> equalsFunction)
    {
        this.equalsFunction = equalsFunction;
    }

    public EqualityComparer(Func<T, T, bool> equalsFunction,
        Func<T, int> getHashCodeFunction)
        : this(equalsFunction)
    {
        this.getHashCodeFunction = getHashCodeFunction;
    }

    public bool Equals(T a, T b)
    {
        return equalsFunction(a, b);
    }

    public int GetHashCode(T obj)
    {
        if (getHashCodeFunction == null)
            return obj.GetHashCode();
        return getHashCodeFunction(obj);
    }
}

That is the resulting DataTable.
DistinctDataTableVersion3
That are all possibilities i could think of to distinct a DataTable.