The data source is a collection of CSV files so there is no actual database. This is an integration with a decades-old legacy system based in Japan.
I have a c# function that needs to take 2 DataTables
and 2 column names as params. My function needs to do the equivalent of an INNER JOIN
on these 2 datatables, and then return all columns from the first table, and only the "joined column" from the second table.
The schemas (read: columns) of these datatables won't be known until runtime, so the function cannot have any hardcoded column names. My function at last needs to return a new DataTable with the inner-joined data, and a DISTINCTed resultset based on the select list as just specified.
Here is my [modified] attempt, which seems to produce a promising resultset:
public static DataTable JoinDataTables2(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField) {
DataTable result = ( from dataRows1 in dt1.AsEnumerable()
join dataRows2 in dt2.AsEnumerable()
on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
select dataRows1).CopyToDataTable();
return result;
}
I call it like this:
Common.JoinDataTables2(dtCSV, _dtModelOptions, "CMODEL", "ModelID");
My objective is to perform an inner join like in a physical database, with a distincted resultset based on the resultset specified above. You may wonder why I'm not simply doing the join in the database. It's because there is no database; the data comes from CSV files generated from a third-party system.
So I have 3 remaining problems:
- I'm unsure whether the resultset I'm getting back is correct based on INNER JOIN behavior.
- The select list doesn't include the "join column" for the 2nd datatable (in this particular example, that would be "ModelID"), and I need it to. Once it does that, I can confirm that the CMODEL values match the ModelID values, and thus confirm that I have a valid join. (That's just 1 scenario, it will be different, so no column names can be hardcoded in the function.)
- How to DISTINCT the resultset?
Here's one concrete example from my system, but again, the datatables and schemas will all be different:
dtCSV columns:
- CMODEL
- CATT_CD
- NSTAND
- CAPPLY1
- CAPPLY2
- DREFIX_D
_dtModelOptions columns:
- SeriesID
- ModelID
- OptionID
What changes do I need for my function so that: - It does an INNER JOIN and a DISTINCT (is it doing this already?)
- It selects all columns from the first table and only the "join-column" from the second table(currently it only gets the first table's columns)
- Performance is as fast as possible (I was previously
foreach
ing through records to achieve the join, and that approach was terribly slow.)
4条答案
按热度按时间yv5phkfx1#
早期解决方案...
您可以在选择查询中列出表1中的所有列。以下查询具有预先定义的DataTable,其中所有列都来自表1,只有键列来自表2。这可能会对您有所帮助。
lvjbypge2#
这里有一点模糊,但据我所知,您需要对两个表执行
Join
,并在对Join
的结果应用Distinct()
后从两个表(或更少)中获取一行。下面是我的解决方案:
1.添加一个
Result
类来 PackageJoin
的结果1.添加一个
ResultComparer
类,帮助您使用自己的逻辑获得Distinct()
结果1.更新方法以使用上述类
t1qtbnec3#
1.基于INNER JOIN行为,我不确定返回的结果集是否正确。
linq查询返回的结果集精确地表示您在查询中写入的内容。
1.选择列表不包括第二个数据表的"join column"(在这个特定的例子中,应该是"ModelID"),而我需要它。
答案很简单:您的查询只返回第一个数据表中的数据(顺便说一句:您已经在问题描述中提到过)。
一旦它完成了这个操作,我就可以确认CMODEL值与ModelID值匹配,从而确认我有一个有效的连接。(这只是一个场景,情况会有所不同,因此没有列名可以硬编码到函数中。)
您可以确保Linq2DataSet查询返回正确的ID。它们必须匹配才能联接它们。如果没有匹配,结果集将为空!看来,您必须提高有关联接的知识。请阅读这篇优秀的文章:Visual Representation of SQL Joins
相关文章的简短版本:
1.如何辨别结果集?
有一个Distinct method。
但我不确定你是否真的需要这个(
读取分隔文件(*. csv)的方法有以下几种:
请参阅:A Fast CSV Reader
Select()
若要能够从链接数据表投影字段,您必须使用:
如果要使用Linq创建动态列,请参见:使用LINQ查询具有动态列名的数据表
请参阅:
Using OleDb To Import Text Files tab CSV Custom
读取文本文件特定列
扩展方法可能如下所示:
lymnna714#
If every CSV-file represents one table of your database, then Consider doing something similar as entity framework.
Instead of
IQueryable<...>
let yourDbSets
implementIEnumerable<...>
If you only need to fetch the data, this will be fairly easy. If you also want to update, you'll need to implement (or re-use) a
DbChangeTracker
You'll also need a DbContext that holds all your DbSets:
You can improve performance by remembering already fetched items. Put them in a Dicationary, use the primary key as dicationary key. Also add a
Find
function to the DbSet:It's most easy if every table item has the same type of primary key:
If not, you'll need to tell your DbSet the type of primary key:
If you decide to keep your items in a Dictionary, then let your GetEnumerator first return the already fetchedItems, before fetching new rows from your CSV-file.
Add / Update / Remove items
For this, you need to be able to Add / Update / Remove items from your CsVFile. I assume there are already functions for that.
To do the update efficiently you'll need something similar to DbContext.SaveChanges. Let every DbSet remember what items to Add / Remove / Update, using a ChangeTracker:
You'll need a Find that remembers the original value:
Finally your SaveChanges()
Apparently if you want to be able to update items in your database you need to be able to check whether items have changed. You'll need an
IEqualityComparer<T>
that checks by valueDbSet SaveChanges:
DbContext SaveChanges: