英文:
Comparing Largish DataSets C# or SQL for OFAC
问题
我需要找到一种快速比较两个大型数据集的方法,我们正在比较一个名单(约31,000条记录)与OFAC的SDN名单(约30,000条记录,包括主要SDN名单和备用名称SDN名单的组合)。如果我算得对的话,这个组合将导致一个包含超过9亿条记录的数据表。根据我的推算,如果每次比较一个名字与OFAC SDN信息需要30秒,那么在我的个人电脑上完成这个过程将需要约10天的时间,使用Parallel.ForEach循环。我在我的测试中使用了一个C#控制台应用程序。数据存储在MS SQL数据库中。我可以选择在本地或者在SQL服务器上使用任何一个作为处理源。
请原谅我如果我的估计有误,但主要问题是,这将需要很长时间。
这是用于初始过程的。在随后的比较运行中,我将检查名字是否已经在结果表中进行了比较,如果已经比较过,我将跳过它。
我该如何加速初始比较过程,使其不要花费那么长时间?谢谢你的帮助和建议!
以下是我的控制台应用程序代码。
英文:
I need to find a way to compare 2 large datasets quickly, we are comparing a list of names (approximately 31,000 entries) against OFAC's SDN List (approximately 30,000 entries, a combination of the main SDN list and the alternate names SDN list). If I'm doing my math correctly, this combination would result in a datatable with over 900,000,000 entries. Based on my extrapolations it would take about 10 days for this to finish on my pc, using Parallel.ForEach loops, if it takes 30 seconds to compare 1 name against the OFAC SDN information. I'm using a C# console app for my testing. The data is stored in MS SQL database. I'm open to using either source for the processing, locally or on the SQL Server.
Please forgive me if my estimations are wrong, but the main crux is, this is going to take a long time.
This is for the initial process. In subsequent comparison runs, I'll check the names against a results table to see if the comparison has already been done and skip it if it's already been done.
How do I speed this up so the initial comparison process does not take so long??? Thanks for your help and tips!!
Here's my console app code.
List<CUSTOMER> CUSTOMERList = conn.CUSTOMERs.Take(1).ToList();
List<OFAC_Main> ofacMainList = conn.OFAC_Main.Include("OFAC_Alternate").Include("OFAC_Address").Include("Ofac_RemarkComment").ToList();
int minimumCharacterCount = Convert.ToInt32(Setting.SelectByName("MinimumCharacters").Value);
//loop through the CUSTOMER list
Parallel.ForEach(CUSTOMERList, CUSTOMER =>
{
//find the values in the possible name fields in each CUSTOMER record
List<string> CUSTOMERNames = GetNames(CUSTOMER);
//loop through the names found in the CUSTOMER record
Parallel.ForEach(CUSTOMERNames, CUSTOMERName =>
{
//if the CUSTOMER name is longer than the minimum character count setting, process it
if (CUSTOMERName.Length > minimumCharacterCount)
{
//loop through the SDN list and process the name
Parallel.ForEach(ofacMainList, ofacName =>
{
//convert List to json object
//string ofacJson = JsonConvert.SerializeObject(ofacName);
//if the SDN name is longer than the minimum character count setting, process it
if (ofacName.SDN_Name.Length > minimumCharacterCount)
{
Compare(CUSTOMERName.ToString(), ofacName.SDN_Name.ToString());
}
//if the SDN Name has Alternate names too, need to run the comparison against it too
if (ofacName.OFAC_Alternate.Count > 0)
{
foreach (OFAC_Alternate alternateName in ofacName.OFAC_Alternate)
{
//if the SDN Alternate name is longer than the minimum character count setting, process it
if (alternateName.Alt_Name.Length > minimumCharacterCount)
{
Compare(CUSTOMERName.ToString(), alternateName.Alt_Name.ToString());
}
}
}
});
}
});
});
答案1
得分: 1
我同意!你需要将这个问题分解成小块。一个想法是在SQL中进行连接和筛选,然后当你有一个包含所有潜在比较的表时,开始循环执行你的Compare()函数。你需要的连接称为笛卡尔连接,如果不指定任何连接条件,就会得到这种连接方式。这样,你可以将Compare()函数的结果存储在Comparisons表中,并且丢弃任何不感兴趣的比较。所以类似这样...
CREATE table Comparisons (...)
INSERT into Comparisons(...) (
SELECT C.Name, O.OFAC_Alternate, OFAC_Address, OFAC_RemarkComment
FROM Customers C, OFAC_Main O
WHERE len(C.Name) > @minimumCharacterCount
and(
len(O.OFAC_Name) > @minimumCharacterCount
or len(O.OFAC_Alternate > @minimumCharacterCount)
)
-- 你可以在这里设置一些任意的筛选条件,将这个大表分割成小块。
and C.id < 1000
)
这部分不会花太多时间。问题在于Compare()函数,我不知道你要如何加速它。
英文:
I agree ! You need to chop this down into small pieces. One idea would be to do the joining and filtering in SQL, then, when you have a table with all the potential comparisons, start looping through doing your Compare(). The join you need is called a cartesian join, and it's what you get if you don't specify any join criteria. That way, you can store the result of Compare() in the Comparisons table, and throw out any comparisons that are not interesting. so something like this...
CREATE table Comparisons (...)
INSERT into Comparisons(...) (
SELECT C.Name, O.OFAC_Alternate, OFAC_Address, OFAC_RemarkComment
FROM Customers C, OFAC_Main O
WHERE len(C.Name) > @minimumCharacterCount
and(
len(O.OFAC_Name) > @minimumCharacterCount
or len(O.OFAC_Alternate > @minimumCharacterCount)
)
-- you can set some arbitrary filter here to chop this mega table into smaller pieces.
and C.id < 1000
)
This part won't take long. It's the Compare() function that will kill you. I have no idea how you're going to speed that up.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论