有没有办法加快这个涉及大量字符串的查询?

huangapple go评论56阅读模式
英文:

Any way to speed up this string heavy query?

问题

The provided code appears to be written in C# and involves some operations related to finding duplicates in a collection of business addresses based on certain criteria, such as city, zip, county code, and a fuzzy match on the first line of the address. The code also discusses performance issues and mentions the use of the FuzzyString library for string comparisons.

If you need specific assistance with optimizing the code or have any questions related to it, please feel free to ask.

英文:
var addressDuplicates = allBusinessAddressIds.AsEnumerable().AsParallel().
    Where(ba => currentBusinessAddress.AddressLine1.ApproximatelyEquals(ba.AddressLine1, options, tolerance)
                 && currentBusinessAddress.City.Equals(ba.City) 
                 && ba.CountyCode.Equals(currentBusinessAddress.CountyCode)
                 && currentBusinessAddress.Zip.Equals(ba.Zip)
                 && ba.BusinessAppId != appId).
    ToList();

The above query is designed to find duplicates based on the city, zip, county code, and fuzzy equal on the first line of the address. However, this query takes around 20 seconds to complete. That query is hit 6695 times with 6635 values. Any thoughts on speeding this up?

It is within a Parallel.ForEach loop as well for additional speed. My theory is that the string comparisons (all of those are strings, including the zip) is what is slowing down the query.

The allBusinessAddressIds is not an EF object, but rather a custom object I created. The currentBusinessAddress, however, is an EF object.

Adding AsEnumerable().AsParallel() had no impact on the search speed.

ApproximatelyEquals is from the FuzzyString library with the Tolerance set to Strong and the options that are being passed is simply UseLongestCommonSubstring.

答案1

得分: 5

问题有点模糊,所以我的答案也会有点模糊。事情是这样的,你有测试用例,需要对其进行性能分析,看看哪些地方比较慢,尝试进行更改,并评估哪些更改使其变得更快。

一般提示:只在有意义的最高级别上进行并行化。所以如果你已经有一个并行的foreach循环,请不要在其中使用并行处理。

我怀疑你的ApproximatelyEquals可能有点慢,所以我首先要尝试的更改是在比较地址之前检查邮政编码和城市。请记住,&&运算符会短路,模糊匹配意味着你会从比较中获得更多的true结果。通过首先检查更快和更严格的项目,你可以使条件表达式更有效,其中较慢的部分只需要在其他所有条件都通过的情况下运行。

为了减少调用次数,你可以将问题分解成较小的块。使用ToLookup()将所有项目分组到根据邮政编码和城市的桶中,然后尝试在这些块中查找重复项。

关于推理:假设你有1000个条目,它们均匀分布在40个邮政编码/城市组合中。
然后,这段代码:

for (int i = 0; i < 999; i++)
  for (int j = i+1; j < 1000; j++)
    if(compare(i, j))
      duplicates.Add(i, j);

将调用compare 500k次(多多少少)
而这段代码:

var groups = items.ToLookup(...); // 40个组
foreach(var group in groups)
  for (int i = 0; i < group.Count-1; i++)
    for (int j = i+1; j < group.Count; j++)
      if(compare(i, j))
        duplicates.Add(i, j);

只会调用compare 40 * 25 * 12 = 12k次。

总结:如果工作的一部分是二次的,许多较小的块比一个块更好。

英文:

So the question is a bit vague so my answer will be a bit vague as well. The thing is, you have the test case, you need to profile it and see what is slow, try changes and evaluate which change make it faster.

General tip: Parallelize only once and do it on the highest level that makes sense. So if you allready have a parallel foreach loop, do not use parallelism inside of that.

I suspect that your ApproximatelyEquals is a bit slow so the first change I'd try is to check zip code and city before comparing the addresses. Remember, the &amp;&amp; operator short-circuits, and the fuzzy match means you'll get more true results from that comparison. By checking the faster and stricter items first, you can make the conditional expression more efficient, where the slower part of the expression only needs to run if and only if the others all pass.

And to reduce the number of calls, you can break the problem into smaller chunks. Use ToLookup() to group all items into buckets according to zip and cifty and then try to find duplicates inside of those chunks.

For the reasoning: Suppose you have 1000 entries that are evenly distributed across 40 combinations of zip/city.
Then this code:

for (int i = 0; i &lt; 999; i++)
  for (int j = i+1; j &lt; 1000; j++)
    if(compare(i, j))
      duplicates.Add(i, j);

Calls compare 500k times (give or take)
Whereas this:

var groups = items.ToLookup(...); // 40 groups
foreach(var group in groups)
  for (int i = 0; i &lt; group.Count-1; i++)
    for (int j = i+1; j &lt; group.Count; j++)
      if(compare(i, j))
        duplicates.Add(i, j);

Will call compare only 40 * 25 * 12 = 12k times.

The summary: If a part of the work is quadratic, many smaller chunks are better than all in one.

huangapple
  • 本文由 发表于 2023年6月13日 02:02:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459212.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定