2023年2月6日 03:04:20go评论94阅读模式

英文:

How can I make this Indexing algorithm more efficient?

问题

我有一个包含172,033行的数据框（从一个具有各种列的CSV文件派生而来）。我创建了一个自定义索引函数，用于阻止那些没有相似“name”属性的记录对。问题出在算法的效率上。仅仅为了达到第10次迭代，需要大约一分钟的时间。因此，索引整个数据集需要太多时间。如何能使我的算法更加高效？

class CustomIndex(BaseIndexAlgorithm):
    def _link_index(self, df_a, df_b):
        indici1=[]
        indici2=[]
        for i in range(0, 173033):
            if(i%2 == 0):
                print(i) # 用于跟踪迭代次数
            for j in range(i, 173033):
                if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name']) > 0.5):
                    indici1.append(i)
                    indici2.append(j)
        
        indici = [indici1, indici2]
        return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))

我想获得一个MultiIndex对象，它将是一个包含那些足够相似以不被阻止的记录对的索引的数组。

[MultiIndex([(0, 0),
             (0, 22159),
             (0, 67902),
             (0, 67903),
             (1, 1),
             (1, 1473),
             (1, 5980),
             (1, 123347),
             (2, 2),
             ...

这是相似性函数的代码：

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

这是我作为输入的数据帧的示例：

   name
0  Amazon
1  Walmart
2  Apple
3  Amazon.com
4  Walmart Inc.

我希望生成的MultiIndex包含0和3之间、1和4之间的元组链接以及所有重复的（0和0、1和1等）。

英文:

I've got a Dataframe (deriving from a csv file with various columns) with 172033 rows. I've created a custom indexing function that blocks pairs of records that haven't got similar 'name' attributes. The problem resides in the efficiency of the algorithm. Just to get to the 10th iteration it takes about a minute. Therefore indexing the whole dataset would take way too much time. How can I make my algorithm more efficient?

class CustomIndex(BaseIndexAlgorithm):
    def _link_index(self, df_a, df_b):
        indici1=[]
        indici2=[]
        for i in range(0, 173033):
            if(i%2 == 0):
                print(i) #keeps track of the iteration
            for j in range(i, 173033):
                if(similar(df_a.loc[i, &#39;name&#39;], df_a.loc[j, &#39;name&#39;])&gt;0.5):
                    indici1.append(i)
                    indici2.append(j)
        
        indici = [indici1, indici2]
        return pd.MultiIndex.from_arrays(indici, names=(&#39;first&#39;, &#39;second&#39;))

I want to obtain a MultiIndex object, which would be an array of tuples contains the indexes of the pairs of records which are similar enough to not be blocked.

[MultiIndex([(     0,    0),
             (     0,    22159),
             (     0,    67902),
             (     0,    67903),
             (     1,    1),
             (     1,    1473),
             (     1,    5980),
             (     1,    123347),
             (     2,    2),
             ...

Here's the code for the similarity function:

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Here's an example of the dataframe I have as input:

   name
0  Amazon
1  Walmart
2  Apple
3  Amazon.com
4  Walmart Inc.

I would like the resulting MultiIndex to contain tuple links between 0 and 3, 1 and 4 and all the repetitions (0 and 0, 1 and 1 etc.)

答案1

得分: 1

你正在使用list的.append方法，根据PythonWiki，这个方法的时间复杂度是O(1)，但具体操作的耗时可能会受容器历史记录的影响。你可以考虑使用collections.deque，它没有这种怪异的情况，只需添加import collections并执行以下操作：

indici1 = collections.deque()
indici2 = collections.deque()
...
indici = [list(indici1), list(indici2)]

如果这还不够，你可能需要一个用于可能的改进的similar函数。

英文:

You are using .append method of list, that method according to PythonWiki is O(1) but Individual actions may take surprisingly long, depending on the history of the container.. You might use collections.deque which does have such quirks, just add import collections and do

indici1=collections.deque()
indici2=collections.deque()
...
indici = [list(indici1), list(indici2)]

If that would not help enough you would need similar function for possible improvements.

答案2

得分: 1

正如其他人指出的，解决你的问题需要O(N^2)的运行时间，这意味着对于非常大的数据集来说不会很高效。不过，我认为仍然有很大的改进空间。

以下是一些可以加速你的代码的策略：

如果你的数据集包含许多重复的name值，你可以使用“记忆化”来避免重复计算重复名称对的similar分数。当然，缓存所有172k^2对将会非常昂贵，但如果数据已按名称预先排序，那么使用包含172k个项的lru_cache 应该可以正常工作。
从difflib文档来看，你有快速筛选“明显”不匹配的选项。如果你预计大多数对很容易从考虑中排除，那么首先调用SequenceMatcher.quick_ratio()（甚至real_quick_ratio()），然后只在有必要时调用ratio() 是有意义的。
在常规控制流中会有一些开销。
- 在for循环中多次调用df.loc可能比简单迭代慢一些。
- 你可以使用itertools.combinations来避免自己编写嵌套的for循环。
- 另外，tqdm提供了一个方便的进度条，比你代码中的print语句更好地指示真正的进度。

最后，我在你上面的函数中看不到df_b参数的必要，所以我没有在下面的代码中包含它。以下是完整的解决方案：

import pandas as pd
from difflib import SequenceMatcher
from functools import lru_cache
from itertools import combinations
from tqdm import tqdm
@lru_cache(173_000)
def is_similar(a, b):
    matcher = SequenceMatcher(None, a, b)
    if matcher.quick_ratio() <= 0.5:
        return False
    return matcher.ratio() > 0.5
def link_index(df):
    # 我们初始化索引结果对，包括 [(0,0), (1,1), (2,2), ...]
    # 因为它们在逻辑上是“链接”的，而你的问题陈述中说你希望它们在结果中。
    indici1 = df.index.tolist()
    indici2 = df.index.tolist()
    # 对名称进行排序，以便我们的lru_cache有效，尽管它仅限于173k个条目。
    name_items = df['name'].sort_values().items()
    pairs = combinations(name_items, 2)
    num_pairs = math.comb(len(names), 2)
    for (i, i_name), (j, j_name) in tqdm(pairs, total=num_pairs):
        if is_similar(i_name, j_name):
            indici1.append(i)
            indici2.append(j)
    indici = [indici1, indici2]
    links = pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
    return links.sortlevel([0,1])[0]

快速测试：

names = ['Amazon', 'Walmart', 'Apple', 'Amazon.com', 'Walmart Inc.']
df = pd.DataFrame({'name': names})
link_index(df)

输出：

(MultiIndex([(0, 0),
             (0, 3),
             (1, 1),
             (1, 4),
             (2, 2),
             (3, 3),
             (4, 4)],
            names=['first', 'second']),
 array([0, 5, 1, 6, 2, 3, 4]))

希望这能够加速你在实际数据上的操作！

英文:

As others have pointed out, the solution to your problem requires O(N^2) running time, which means it won't scale well for very large datasets. Nonetheless, I think there's still a lot of room for improvement.

Here are some strategies you can use to speed up your code:

If your dataset contains many duplicate name values, you can use "memoization" to avoid re-computing the similar score for duplicate name pairs. Of course, caching all 172k^2 pairs would be devastatingly expensive, but if the data is pre-sorted by name, then lru_cache with 172k items should work just fine.
Looking at the difflib documentation, it appears that you have the option of quickly filtering out "obvious" mismatches. If you expect most pairs to be "easy" to eliminate from consideration, then it makes sense to first call SequenceMatcher.quick_ratio() (or even real_quick_ratio()), followed by ratio() only if necessary.

There will be some overhead in the ordinary control flow.
- Calling df.loc many times in a for-loop might be a bit slow in comparison to simple iteration.
- You can use itertools.combinations to avoid writing a nested for-loop yourself.
- BTW, tqdm provides a convenient progress bar, which will give a better indication of true progress than the print statements in your code.

Lastly, I saw no need for the df_b parameter in your function above, so I didn't include it in the code below. Here's the full solution:

import pandas as pd
from difflib import SequenceMatcher
from functools import lru_cache
from itertools import combinations
from tqdm import tqdm
@lru_cache(173_000)
def is_similar(a, b):
    matcher = SequenceMatcher(None, a, b)
    if matcher.quick_ratio() &lt;= 0.5:
        return False
    return matcher.ratio() &gt; 0.5
def link_index(df):
    # We initialize the index result pairs with [(0,0), (1,1,), (2,2), ...]
    # because they are trivially &quot;linked&quot; and your problem statement
    # says you want them in the results.
    indici1 = df.index.tolist()
    indici2 = df.index.tolist()
    # Sort the names so that our lru_cache is effective,
    # even though it is limited to 173k entries.
    name_items = df[&#39;name&#39;].sort_values().items()
    pairs = combinations(name_items, 2)
    num_pairs = math.comb(len(names), 2)
    for (i, i_name), (j, j_name) in tqdm(pairs, total=num_pairs):
        if is_similar(i_name, j_name):
            indici1.append(i)
            indici2.append(j)
    indici = [indici1, indici2]
    links = pd.MultiIndex.from_arrays(indici, names=(&#39;first&#39;, &#39;second&#39;))
    return links.sortlevel([0,1])[0]

Quick Test:

names = [&#39;Amazon&#39;, &#39;Walmart&#39;, &#39;Apple&#39;, &#39;Amazon.com&#39;, &#39;Walmart Inc.&#39;]
df = pd.DataFrame({&#39;name&#39;: names})
link_index(df)

Output:

(MultiIndex([(0, 0),
             (0, 3),
             (1, 1),
             (1, 4),
             (2, 2),
             (3, 3),
             (4, 4)],
            names=[&#39;first&#39;, &#39;second&#39;]),
 array([0, 5, 1, 6, 2, 3, 4]))

Let me know if that speeds things up on your actual data!

Let's set some realistic expectations.

Your original estimate was ~1 minute for 10 "iterations". That implies the total time would have been ~6 days:

print(math.comb(172033, 2) / (10*172033) / 60 / 24)

On the other hand, merely iterating through the full set of i,j combinations and doing absolutely nothing with them would take ~45 minutes on my machine. See for yourself:

sum(1 for _ in tqdm(combinations(np.arange(172033), 2), total=math.comb(172033, 2)))

So the real solution will take longer than that. Now you've got some bounds on what the optimal solution will require: Somewhere between ~1 hour and ~6 days. Hopefully it's closer to the former!

答案3

得分: 0

我们要求您公开similar()函数，但您拒绝这样做。

您编写了以下内容：

        for i in range(0, 173033):
            ...
            for j in range(i, 173033):
                if similar(df_a.loc[i, 'name'], df_a.loc[j, 'name']) > 0.5:

因此，您计划调用这个神秘的similar函数 29,940,419,089（300亿）次。我猜这可能需要一些时间，也许两周？

我们将这称为O(n^2)二次性能。

以下是解决方法。

首先，对数据集进行排序。这只需要O(n log n)的成本，很便宜！

接下来，在similar损失函数中找到一些东西，使您能够简化问题，或者设计一个新的相关损失函数来允许这样做。剧透警告——我们只能与本地邻居进行比较，而不是与距离超过100,000位置的远程条目进行比较。我将重点放在五个示例单词上。

似乎您神秘的函数报告了一个大值，当比较这些字符串与“Apple”时，报告了一个小值。因此，让我们采用一个新的成本函数规则。如果两个字符串的初始字符不相同，则相似度立即报告为0。

现在我们要比较Apple和两个Amazon，以及Walmart和自身。我们已经将问题分成了“A”和“W”子集。这个二次性能怪兽开始消失。

小便签：由于您的相似函数很可能是对称的，所以只需要检查正方形的一半，其中i < j。

针对这个大小的数据集的实际实现可能需要更积极的方法，也许要求初始的2或3个字母是相同的。也许您可以运行一个连字符算法，查看相同的音节。或者使用Metaphone或Soundex。

您可以做的最简单的事情是定义一个窗口W，可能是10。在检查排序后的数据集时，我们可以任意宣布i、j条目之间的相似度为零，如果abs(i - j) > W。这可能会影响准确性，但对于性能来说非常好，它可以让您在甚至调用函数之前修剪搜索空间。我们从O(n^2)变成了O(n)线性。如果您从不等待足够长时间来生成答案，那么完美的准确性几乎无关紧要。

使用这些想法编写一个解决方案，然后让我们知道您是如何解决细节的。

编辑

@Pierre D.主张使用局部敏感哈希来避免由于{prefix1}{entity}和{prefix2}{entity}在排序列表中远离彼此，而在现实生活中和哈希空间中靠得很近而导致的丢失的收入。Metaphone是诱导LSH碰撞的一种技术，但它很少对初始字母/排序顺序产生任何影响。Das, et al.在“Google News Personalization: Scalable Online Collaborative Filtering”中描述了一种MinHash技术。您要检查的商家名称的音节数将明显少于Google看到的点击集的大小。哈希给出了二进制的命中或不命中的结果，而不是平滑的排名或距离度量。需要促进邻居碰撞，同时避免生日悖论，这使得难以推荐在没有生产数据集的快照的情况下调整参数。不过，这是一种您应该记在心中的技术。

要求spacy在数据上进行命名实体识别的预处理可能也值得一试，以期望归一化或丢弃任何麻烦的“噪音”前缀。

英文:

We asked you to reveal the similar() function but you've declined to do so.

You wrote

        for i in range(0, 173033):
            ...
            for j in range(i, 173033):
                if similar(df_a.loc[i, &#39;name&#39;], df_a.loc[j, &#39;name&#39;]) &gt; 0.5:

So you plan to call that mysterious similar 29_940_419_089 (30 billion) times.
I'm guessing that's going to take some little while, maybe two weeks?
We describe this as O(n^2) quadratic performance.

Here is the way out.

First, sort your dataset. It will only cost O(n log n), cheap!

Next, find something in the similar loss function that allows you to
simplify the problem, or design a new related loss function which allows that.
Spoiler alert -- we can only do comparisons against local neighbors,
not against far flung entries more than 100,000 positions away.
I will focus on the five example words.

It seems likely that your mysterious function reports
a large value for similar("Walmart", "Walmart Inc.")
and a small value when comparing those strings against "Apple".
So let's adopt a new cost function rule.
If initial character of the two strings is not identical,
the similarity is immediately reported as 0.

So now we're faced with comparing Apple against two Amazons,
and Walmart with itself.
We have partitioned the problem into "A" and "W" subsets.
The quadratic monster is starting to die.

Minor side note: Since your similarity function likely is
symmetric, it suffices to examine just half the square,
where i < j.

A practical implementation for a dataset of this size
is going to need to be more aggressive, perhaps insisting
that initial 2 or 3 letters be identical.
Perhaps you can run a hyphenation algorithm and look
at identical syllables.
Or use Metaphone
or Soundex.

The simplest possible thing you could do is define a
window W of maybe 10. When examining the sorted dataset,
we arbitrarily declare that similarity between i, j entries
shall be zero if abs(i - j) > W.
This may impact accuracy, but it's great for performance,
it lets you prune the search space before you even call the function.
We went from O(n^2) to O(n) linear.
Perfect accuracy is hardly relevant if you never wait
long enough for the code to produce an answer.

Use these ideas to code up a solution, and
let us know
how you resolved the details.

EDIT

@Pierre D. advocates the use of
locality sensitive hashing
to avoid lost revenue due to {prefix1}{entity} and {prefix2}{entity}
being far apart in a sorted list yet close together IRL and in hash space.
Metaphone is one technique for inducing LSH collisions,
but it seldom has any effect on initial letter / sort order.
Das, et al.,
describe a MinHash technique in
"Google News Personalization: Scalable Online Collaborative Filtering".
Number of syllables in the business names you're examining
will be significantly less than size of click-sets seen by Google.
Hashes give binary results of hit or miss,
rather than a smooth ranking or distance metric.
The need to promote neighbor collisions,
while steering clear of the birthday paradox,
makes it difficult to recommend tuning parameters
absent a snapshot of the production dataset.
Still, it is a technique you should keep in mind.

Asking spacy to do a
named entity recognition
pre-processing pass over your data might also be
worth your while, in the hopes of normalizing
or discarding any troublesome "noise" prefixes.

答案4

得分: 0

以下是翻译好的部分：

"Indexing the dataset takes about 20 minutes on my machine which is great compared to before. Thanks everyone for the help!"
"对数据集进行索引大约需要20分钟，这与以前相比非常出色。感谢大家的帮助！"

"I'll be looking into locality sensitive hashing suggested by Pierre D."
"我将研究皮埃尔·D建议的局部敏感哈希。"

英文:

In the end this is what I came up. After sorting the dataset in alphabetical order according to the attribute 'name', I use this custom index. Every row in the dataset is compared to its neighbours in the range of 100 rows.

class CustomIndex(BaseIndexAlgorithm):
    def _link_index(self, df_a, df_b):
        t0 = time.time()
        indici1=[]
        indici2=[]
        for i in range(1, 173034):
            if(i%500 == 0):
                print(i)
            if(i&lt;100):
                n = i
                for j in range((i-(n-1)), (i+100)):
                    if(similar(df_a.loc[i, &#39;name&#39;], df_a.loc[j, &#39;name&#39;])&gt;0.35):
                        indici1.append(i)
                        indici2.append(j)
            elif(i&gt;172932):
                n = 173033 - i
                for j in range((i-100), (i+n)):
                    if(similar(df_a.loc[i, &#39;name&#39;], df_a.loc[j, &#39;name&#39;])&gt;0.35):
                        indici1.append(i)
                        indici2.append(j)
            else:
                for j in range((i-100), (i+100)):
                    if(similar(df_a.loc[i, &#39;name&#39;], df_a.loc[j, &#39;name&#39;])&gt;0.35):
                        indici1.append(i)
                        indici2.append(j)
        
        indici = [indici1, indici2]
        t1 = time.time()
        print(t1-t0)
        return pd.MultiIndex.from_arrays(indici, names=(&#39;first&#39;, &#39;second&#39;))

Indexing the dataset takes about 20 minutes on my machine which is great compared to before. Thanks everyone for the help!

I'll be looking into locality sensitive hashing suggested by Pierre D.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使这个索引算法更高效？

问题

答案1

答案2

答案3

答案4

如何将数据框中的每一列分别乘以不同的值每列。

Pandas 聚合和对列名应用条件

使用字节切片（或数组）时，优化Go（Golang）代码的几个技巧有哪些？

如何在数组中找到最大的奇数，并在没有奇数的情况下返回0（Java）？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。