快速选择包含另一个列表中至少一个子字符串的字符串列表中的所有元素。

huangapple go评论68阅读模式
英文:

Fast way to select all the elements of a list of strings, which contain at least a substring from another list

问题

我有一个类似这样的字符串列表:

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]

这个列表可能会相当长(在最坏的情况下达到10^6个元素)。我还有另一个包含一些子字符串的列表:

matches = ['7895_001', '3458_669', '0345_123', ...]

我想创建一个名为matched_samples的列表,其中只包含samples中包含matches中一个或多个元素的元素。例如,samples[1]最终会出现在matched_samples中,因为matches[3]samples[1]的子字符串。我可以像这样做:

matched_samples =

展开收缩

然而,这看起来像是一个双重循环,所以速度不会很快。是否有其他替代方法?如果samples是一个pandas数据框,我可以简单地这样做:

matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]

是否有类似快速的替代方法用于列表?

英文:

I have a list of strings like this:

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]

The list can be quite long (up to 10^6 elements in the worst case). I have another list containing some substrings:

matches = ['7895_001', '3458_669', '0345_123', ...]

I would like to create a list matched_samples which contains only the elements of samples which contain one or more element of matches. For example, samples[1] ends up in matched_samples because matches[3] is a substring of samples[1]. I could do something like this:

matched_samples = 
展开收缩

However, this looks like a double for loop, so it's not going to be fast. Is there any alternative? If samples was a pandas dataframe, I could simply do:

matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]

Is there a similarly fast alternative with lists?

答案1

得分: 1

import re

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
matches = ['7895_001', '3458_669', '0345_123']

pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")

展开收缩
['0345_123_2.09_1.3_003']

如果你的样本中没有换行符,你也可以将它们转化为一个字符串,然后在模式周围使用 (?:).* 并使用 .findall()

不确定这是否会在速度上有所不同。

pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
pattern
re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
pattern.findall("\n".join(samples))
['0345_123_2.09_1.3_003']
英文:

You can do the same thing as in your pandas example.

import re

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
matches = ['7895_001', '3458_669', '0345_123']

pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")
>>> [ s for s in samples if pattern.search(s) ]
['0345_123_2.09_1.3_003']

If there are no newlines in your samples - you could turn also turn that into a string and use .findall() with (?:).* around the pattern.

Not sure if that would make a difference speed-wise.

>>> pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
>>> pattern
re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
>>> pattern.findall("\n".join(samples))
['0345_123_2.09_1.3_003']

答案2

得分: -1

这看起来像一个双重循环,所以它不会很快。
首先,请意识到过早的优化通常被认为是不希望的。在这种情况下,您应该回答一个问题:对于您的用例来说,它足够快吗?
其次,观察到

matched_samples =

展开收缩

在很大程度上取决于您的数据。最坏的情况是没有一个匹配项,因为这将导致len(samples)*len(matches)次子字符串检查,最好的情况是如果matches的第一个元素是每个samples元素的子字符串,那么您将得到len(samples)次子字符串检查,因为any在找到第一个真值后停止进一步处理。请注意,这意味着处理时间取决于匹配项的顺序。

英文:

> this looks like a double for loop, so it's not going to be fast.

Firstly please become aware of premature optimization which is generally considered not desired. In this case you should answer question: is it fast enough for your use case?

Secondly, observe that

matched_samples = 
展开收缩

is greatly dependent at your data. Worst case is if there is not single match, as it result in len(samples)*len(matches) of is substring checks, best case is if first element of matches is substring of every element of samples as you will then got len(samples) of is substring checks, as any cease futher processing after find 1st truth-y value. Observe that this mean processing time does depend at matches ordering.

huangapple
  • 本文由 发表于 2023年3月10日 01:58:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75688389.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定