如何进行LINQ联接,使其行为与物理数据库内部联接完全相同?

w41d8nur  于 2022-12-06  发布在  其他
关注(0)|答案(4)|浏览(109)

The data source is a collection of CSV files so there is no actual database. This is an integration with a decades-old legacy system based in Japan.
I have a c# function that needs to take 2 DataTables and 2 column names as params. My function needs to do the equivalent of an INNER JOIN on these 2 datatables, and then return all columns from the first table, and only the "joined column" from the second table.
The schemas (read: columns) of these datatables won't be known until runtime, so the function cannot have any hardcoded column names. My function at last needs to return a new DataTable with the inner-joined data, and a DISTINCTed resultset based on the select list as just specified.
Here is my [modified] attempt, which seems to produce a promising resultset:

public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField) {
   DataTable result = ( from dataRows1 in dt1.AsEnumerable()
                        join dataRows2 in dt2.AsEnumerable()
                        on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                        select dataRows1).CopyToDataTable();
   return result;
}

I call it like this:

Common.JoinDataTables2(dtCSV, _dtModelOptions, "CMODEL", "ModelID");

My objective is to perform an inner join like in a physical database, with a distincted resultset based on the resultset specified above. You may wonder why I'm not simply doing the join in the database. It's because there is no database; the data comes from CSV files generated from a third-party system.
So I have 3 remaining problems:

  1. I'm unsure whether the resultset I'm getting back is correct based on INNER JOIN behavior.
  2. The select list doesn't include the "join column" for the 2nd datatable (in this particular example, that would be "ModelID"), and I need it to. Once it does that, I can confirm that the CMODEL values match the ModelID values, and thus confirm that I have a valid join. (That's just 1 scenario, it will be different, so no column names can be hardcoded in the function.)
  3. How to DISTINCT the resultset?
    Here's one concrete example from my system, but again, the datatables and schemas will all be different:

dtCSV columns:

  1. CMODEL
  2. CATT_CD
  3. NSTAND
  4. CAPPLY1
  5. CAPPLY2
  6. DREFIX_D

_dtModelOptions columns:

  1. SeriesID
  2. ModelID
  3. OptionID
    What changes do I need for my function so that:
  4. It does an INNER JOIN and a DISTINCT (is it doing this already?)
  5. It selects all columns from the first table and only the "join-column" from the second table(currently it only gets the first table's columns)
  6. Performance is as fast as possible (I was previously foreach ing through records to achieve the join, and that approach was terribly slow.)
yv5phkfx

yv5phkfx1#

早期解决方案...

public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField) {
   DataTable result = ( from dataRows1 in dt1.AsEnumerable()
                            join dataRows2 in dt2.AsEnumerable()
                            on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField) 
                            select new {Col1= datarows1Field<string>(table1FieldName), Col2= datarows2.Field<string>(table2FieldName)}).Distinct().CopyToDataTable();
   return result;
}

您可以在选择查询中列出表1中的所有列。以下查询具有预先定义的DataTable,其中所有列都来自表1,只有键列来自表2。这可能会对您有所帮助。

public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField)
{
    DataTable joinTable = new DataTable();
    foreach (DataColumn dt1Column in dt1.Columns)
    {
        joinTable.Columns.Add(dt1Column.ColumnName, dt1Column.DataType);
    }

    var col2 = dt2.Columns[table2KeyField];
    joinTable.Columns.Add(col2.ColumnName,typeof(string));

    var result = (from dataRows1 in dt1.AsEnumerable()
                  join dataRows2 in dt2.AsEnumerable()
                      on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                  select new
                  {
                      Col1 = dataRows1,
                      Col2 = dataRows2.Field<string>(table2KeyField)
                  });
    foreach (var row in result)
    {
        DataRow dr = joinTable.NewRow();
        foreach (DataColumn dt1Column in dt1.Columns)
        {
            dr[dt1Column.ColumnName] = row.Col1[dt1Column.ColumnName];
        }

        dr[table2KeyField] = row.Col2;
        joinTable.Rows.Add(dr);
    }
    joinTable.AcceptChanges();
    return joinTable.AsEnumerable().Distinct().CopyToDataTable();
}
lvjbypge

lvjbypge2#

这里有一点模糊,但据我所知,您需要对两个表执行Join,并在对Join的结果应用Distinct()后从两个表(或更少)中获取一行。
下面是我的解决方案:
1.添加一个Result类来 Package Join的结果

public class Result
{
    public DataRow Table1Row { get; set; }
    public DataRow Table2Row { get; set; }

    public string DistictFieldValue { get; set; }
}

1.添加一个ResultComparer类,帮助您使用自己的逻辑获得Distinct()结果

public class ResultComparer : IEqualityComparer<Result>
{
    public bool Equals(Result x, Result y)
    {
        // Your logic to get distinct elements
        return x.DistictFieldValue == y.DistictFieldValue;
    }

    public int GetHashCode(Result obj)
    {
        return 0; // To enforce the Equals() gets callled.
    }
}

1.更新方法以使用上述类

public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField)
{
    // Join with final selection containing rows from both the tablles
    var query = from dataRows1 in dt1.AsEnumerable()
                join dataRows2 in dt2.AsEnumerable()
                    on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                select new Result
                {
                    Table1Row = dataRows1,
                    Table2Row = dataRows2,
                    DistictFieldValue = dataRows2[table2KeyField].ToString() // This could be anything else, even passed as an argument to the method
                };

    // Dictinct on the results above
    var queryWithDistictResults = query.Distinct(new ResultComparer());

    // Write your logic to convert the Results Collection to a single data table with whatever columns you want
    DataTable result = queryWithDistictResults // <= YOUR LOGIC HERE

    return result;
}
t1qtbnec

t1qtbnec3#

  • [更新#3]*

1.基于INNER JOIN行为,我不确定返回的结果集是否正确。
linq查询返回的结果集精确地表示您在查询中写入的内容。
1.选择列表不包括第二个数据表的"join column"(在这个特定的例子中,应该是"ModelID"),而我需要它。
答案很简单:您的查询只返回第一个数据表中的数据(顺便说一句:您已经在问题描述中提到过)。
一旦它完成了这个操作,我就可以确认CMODEL值与ModelID值匹配,从而确认我有一个有效的连接。(这只是一个场景,情况会有所不同,因此没有列名可以硬编码到函数中。)
您可以确保Linq2DataSet查询返回正确的ID。它们必须匹配才能联接它们。如果没有匹配,结果集将为空!看来,您必须提高有关联接的知识。请阅读这篇优秀的文章:Visual Representation of SQL Joins
相关文章的简短版本:

    • 左联接**
Set1 = [1, 2, 3, 5]
Set2 = [2, 4, 5]
Resultset = [1,2,5] //get [1] from left (set1), [2,5] are common items (set1 and set2)
    • 内部联接**
Set1 = [1, 2, 3, 5]
Set2 = [2, 4, 5]
Resultset = [2,5] //only common items (set1 and set2)
    • 右联接**
Set1 = [1, 2, 3, 5]
Set2 = [2, 4, 5]
Resultset = [2,4,5] // gets [2] from right (set2), [4,5] are common (set1 and set2)
    • 交叉联接**
cross join returns the cartesian product of the sets

1.如何辨别结果集?
有一个Distinct method
但我不确定你是否真的需要这个(

    • 一般注意事项:**

读取分隔文件(*. csv)的方法有以下几种:

    • 1)**使用"标准"读取文本文件方法并在[for]循环中将文本拆分为多个部分

请参阅:A Fast CSV Reader

    • 2)**使用LINQ方法,即:Select()
    • 注:**大多数程序员都知道,在处理大数据集时,linq方法比[for]循环慢得多。

若要能够从链接数据表投影字段,您必须使用:

select new {datarows1, datarows2}

如果要使用Linq创建动态列,请参见:使用LINQ查询具有动态列名的数据表

请参阅:
Using OleDb To Import Text Files tab CSV Custom
读取文本文件特定列
扩展方法可能如下所示:

public static DataTable OleDbJoin(string csv1, string csv2, string key1, string key2)
{
    DataTable dt = new DataTable();

    string sConn = string.Format(@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0}\;Extended Properties='text;HDR=No;FMT=CSVDelimited()';", Path.GetDirectoryName(csv1));
    string sSql = string.Format(@"SELECT T.*
        FROM (
            SELECT * FROM [{0}] AS t1
            INNER JOIN (SELECT * FROM [{1}]) AS t2
                ON t1.[{2}] = t2.[{3}]) AS T;",
            Path.GetFileName(csv1), Path.GetFileName(csv2), key1, key2);

    try
    {
        using (OleDbConnection oConn = new OleDbConnection(sConn))
        {
            using (OleDbCommand oComm = new OleDbCommand(sSql, oConn))
            {
                oConn.Open();
                OleDbDataReader oRdr = oComm.ExecuteReader();
                dt.Load(oRdr);
                oComm.Dispose();
                oRdr.Dispose();
                oConn.Close();
                oConn.Dispose();
            }
        }
    }
    catch(OleDbException ex)
    {
        Console.WriteLine(ex.Message);
    }
    catch(Exception ex)
    {
        Console.WriteLine(ex.Message);
    }

    return dt;
}
  • 电话号码:*
DataTable resultDt = OleDbJoin("FullFileName1", "FullFileName2", "F1", "F2");
  • 要求:*
  • 两个csv文件必须在同一目录中
  • csv文件使用csv文件的标准分隔符,请参见:Schema.ini file
  • 文件中没有标题(没有列名)
lymnna71

lymnna714#

If every CSV-file represents one table of your database, then Consider doing something similar as entity framework.
Instead of IQueryable<...> let your DbSets implement IEnumerable<...>
If you only need to fetch the data, this will be fairly easy. If you also want to update, you'll need to implement (or re-use) a DbChangeTracker

public DbSet<T> : IEnumerable<T> where T: class
{
    public FileInfo CsvFile {get; set;}

    public IEnumerator<T> GetEnumerator()
    {
        return this.ReadCsvFile().GetEnumerator();
    }
    IEnumerator IEnumerable.GetEnumerator()
    {
        return this.GetEnumerator();
    }

    protected IEnumerable<T> ReadCsvFile()
    {
        // open the CsvFile, read the lines and convert to objects of type T
        // consider using Nuget package CsvHelper
        ...
        foreach (var csvLine in csvLines)
        {
            T item = Create<T>(csvLine); // TODO: write how to convert a line into T
            yield return T;
        }
    }
}

You'll also need a DbContext that holds all your DbSets:

class DbContext
{
      public DbSet<School> Schools {get; } = new DbSet<School>{CsvFile = ...};
      public DbSet<Teacher> Teachers {get; } = new DbSet<Teacher> {CsvFile = ...};
      public DbSet<Student> Students {get; } = new DbSet<Student> {CsvFile = ...};
}

You can improve performance by remembering already fetched items. Put them in a Dicationary, use the primary key as dicationary key. Also add a Find function to the DbSet:

class DbSet<T> : IEnumerable<T>
{
    private readonly Dictionary<int, T> fetchedItems = new Dictionary<int, T>();

    public T Find(int id)
    {
        if (!fetchedItems.TryGetValue(id, out T fetchedItem))
        {
            // fetch elements using ReadCsvFile and put them in the Dictionary
            // until you found the item with the requested primary key
            // or until the end of your sequence
        }
        return fetchedItem;
    }
}

It's most easy if every table item has the same type of primary key:

interface IPrimaryKey
{
     int Id {get;}
}

class DbSet<T> : IEnumerable<T> where T : IPrimaryKey {...}

If not, you'll need to tell your DbSet the type of primary key:

class DbSet<T, TKey> : IEnumerable<T> where T : class
{
     private readonly Dictinary<TKey, T> fetchedItems = ...
}

If you decide to keep your items in a Dictionary, then let your GetEnumerator first return the already fetchedItems, before fetching new rows from your CSV-file.

Add / Update / Remove items

For this, you need to be able to Add / Update / Remove items from your CsVFile. I assume there are already functions for that.
To do the update efficiently you'll need something similar to DbContext.SaveChanges. Let every DbSet remember what items to Add / Remove / Update, using a ChangeTracker:

class Entity<T> where T : IPrimaryKey
{
    public T Value {get; set;}
    public T OriginalValue {get; set;}
}

class ChangeTracker<T, TKey> where T: ICloneable
{
    readonly Dictionary<int, Entity<T, TKey>> fetchedEntities = new Dictionary<int, Entity<T, TKey>>
    readonly List<T> itemsToAdd = new List<T>();

    public T Add(T item)
    {
        // TODO: check for not NULL, and Id == 0
        this.ItemsToAdd.Add(itemToAdd);
        return item;
    }
    public void Remove(T item)
    {
        // TODO: check not null, and primary key != 0
        Entity<T> entityToRemove = Find(item.Id);
        // TODO: decide what to do if there is no such item
        entityToRemove.Value = null;
        // null indicates it is about to be removed
    }

You'll need a Find that remembers the original value:

public Entity<T> Find(TKey primaryKey)
{
    // is it already in the Dictionary (found before)?
    // if not: get it from the CsvDatabase and put it in the dictionary
    if (!fetchedItems.TryGetValue(primaryKey, out Entity<T> fetchedEntity))
    {
        // not fetched yet, fetch if from your Csv File
        T fetchedItem = ...
        // what to do if does not exist?
        // add to the dictionary:
        fetchedEntities.Add(new Entity<T>
        {
            value = fetchedItem,
            originalValue = (T)fetchedItem.Clone(),
            // so if value changes, original does not change
        });
    }
    return fetchedItem;
}

Finally your SaveChanges()

void SaveChanges()
{
    // your CsvFile database has functions to add / update / remove items
    foreach (var itemToAdd in itemsToAdd)
    {
        csvDatabase.Add(itemToAdd);
    }

    // update or remove fetched items with OriginalValue unequal to Value
    var itemsToUpdate = this.fetchedItems
        .Where(fetchedItem => !ValueComparer.Equals(fetchedItem.OriginalValue, fetchedItem.Value)
        .ToList();

    foreach (Entity<T> itemToUpdate in itemsToUpdate)
    {
        if (itemToUpdate.Value == null)
        {   // remove
            csvFile.Remove(itemToUpdate.OriginalValue);
        }
        else
        {   // update
            csvFile.Update(...);
        } 
    }
}

Apparently if you want to be able to update items in your database you need to be able to check whether items have changed. You'll need an IEqualityComparer<T> that checks by value

class DbChangeTracker<T, TKey> : IEnumerable<T> where T : class
{
     public IEqualityComparer<T> ValueComparer {get; set;}
     ...
}

DbSet SaveChanges:

void SaveChanges()
{
    this.ChangeTracker.SaveChanges();
}

DbContext SaveChanges:

Students.SaveChanges()
Teachers.SaveChanges();
Schools.SaveChanges();

相关问题