2023年3月10日 01:58:12go评论107阅读模式

英文:

Fast way to select all the elements of a list of strings, which contain at least a substring from another list

问题

我有一个类似这样的字符串列表：

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]

这个列表可能会相当长（在最坏的情况下达到10^6个元素）。我还有另一个包含一些子字符串的列表：

matches = ['7895_001', '3458_669', '0345_123', ...]

我想创建一个名为matched_samples的列表，其中只包含samples中包含matches中一个或多个元素的元素。例如，samples[1]最终会出现在matched_samples中，因为matches[3]是samples[1]的子字符串。我可以像这样做：

matched_samples =

展开收缩

然而，这看起来像是一个双重循环，所以速度不会很快。是否有其他替代方法？如果samples是一个pandas数据框，我可以简单地这样做：

matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]

是否有类似快速的替代方法用于列表？

英文:

I have a list of strings like this:

samples = [&#39;2345_234_1.0_1.35_001&#39;, &#39;0345_123_2.09_1.3_003&#39;, ...]

The list can be quite long (up to 10^6 elements in the worst case). I have another list containing some substrings:

matches = [&#39;7895_001&#39;, &#39;3458_669&#39;, &#39;0345_123&#39;, ...]

I would like to create a list matched_samples which contains only the elements of samples which contain one or more element of matches. For example, samples[1] ends up in matched_samples because matches[3] is a substring of samples[1]. I could do something like this:

matched_samples = 展开收缩

However, this looks like a double for loop, so it's not going to be fast. Is there any alternative? If samples was a pandas dataframe, I could simply do:

matches_regex = &#39;|&#39;.join(matches)
matched_samples = samples[samples[&#39;sample&#39;].str.contains(matches_regex)]

Is there a similarly fast alternative with lists?

答案1

得分: 1

import re
samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
matches = ['7895_001', '3458_669', '0345_123']
pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")
展开收缩
['0345_123_2.09_1.3_003']

如果你的样本中没有换行符，你也可以将它们转化为一个字符串，然后在模式周围使用 (?:).* 并使用 .findall()。

不确定这是否会在速度上有所不同。

pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
pattern
re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
pattern.findall("\n".join(samples))
['0345_123_2.09_1.3_003']

英文:

You can do the same thing as in your pandas example.

import re
samples = [&#39;2345_234_1.0_1.35_001&#39;, &#39;0345_123_2.09_1.3_003&#39;]
matches = [&#39;7895_001&#39;, &#39;3458_669&#39;, &#39;0345_123&#39;]
pattern = re.compile(f&quot;&quot;&quot;{&quot;|&quot;.join(re.escape(m) for m in matches)}&quot;&quot;&quot;)

&gt;&gt;&gt; [ s for s in samples if pattern.search(s) ]
[&#39;0345_123_2.09_1.3_003&#39;]

If there are no newlines in your samples - you could turn also turn that into a string and use .findall() with (?:).* around the pattern.

Not sure if that would make a difference speed-wise.

&gt;&gt;&gt; pattern = re.compile(f&quot;&quot;&quot;(?:{&quot;|&quot;.join(re.escape(m) for m in matches)}).*&quot;&quot;&quot;)
&gt;&gt;&gt; pattern
re.compile(r&#39;(?:7895_001|3458_669|0345_123).*&#39;, re.UNICODE)
&gt;&gt;&gt; pattern.findall(&quot;\n&quot;.join(samples))
[&#39;0345_123_2.09_1.3_003&#39;]

答案2

得分: -1

这看起来像一个双重循环，所以它不会很快。
首先，请意识到过早的优化通常被认为是不希望的。在这种情况下，您应该回答一个问题：对于您的用例来说，它足够快吗？
其次，观察到

matched_samples =

展开收缩

在很大程度上取决于您的数据。最坏的情况是没有一个匹配项，因为这将导致len(samples)*len(matches)次子字符串检查，最好的情况是如果matches的第一个元素是每个samples元素的子字符串，那么您将得到len(samples)次子字符串检查，因为any在找到第一个真值后停止进一步处理。请注意，这意味着处理时间取决于匹配项的顺序。

英文:

> this looks like a double for loop, so it's not going to be fast.

Firstly please become aware of premature optimization which is generally considered not desired. In this case you should answer question: is it fast enough for your use case?

Secondly, observe that

matched_samples = 展开收缩

is greatly dependent at your data. Worst case is if there is not single match, as it result in len(samples)*len(matches) of is substring checks, best case is if first element of matches is substring of every element of samples as you will then got len(samples) of is substring checks, as any cease futher processing after find 1st truth-y value. Observe that this mean processing time does depend at matches ordering.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

快速选择包含另一个列表中至少一个子字符串的字符串列表中的所有元素。

问题

答案1

答案2

preg_replace在不应该有匹配项的地方找到了一个匹配项

将两个具有一对多关系的数据框合并。

如何获得半幻方？

我需要帮助加速一个包含大量计算的Python for循环。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。