英文:
Fast way to select all the elements of a list of strings, which contain at least a substring from another list
问题
我有一个类似这样的字符串列表:
samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]
这个列表可能会相当长(在最坏的情况下达到10^6个元素)。我还有另一个包含一些子字符串的列表:
matches = ['7895_001', '3458_669', '0345_123', ...]
我想创建一个名为matched_samples
的列表,其中只包含samples
中包含matches
中一个或多个元素的元素。例如,samples[1]
最终会出现在matched_samples
中,因为matches[3]
是samples[1]
的子字符串。我可以像这样做:
matched_samples =
然而,这看起来像是一个双重循环,所以速度不会很快。是否有其他替代方法?如果samples
是一个pandas
数据框,我可以简单地这样做:
matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]
是否有类似快速的替代方法用于列表?
英文:
I have a list of strings like this:
samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]
The list can be quite long (up to 10^6 elements in the worst case). I have another list containing some substrings:
matches = ['7895_001', '3458_669', '0345_123', ...]
I would like to create a list matched_samples
which contains only the elements of samples
which contain one or more element of matches
. For example, samples[1]
ends up in matched_samples
because matches[3]
is a substring of samples[1]
. I could do something like this:
matched_samples = 展开收缩
However, this looks like a double for loop, so it's not going to be fast. Is there any alternative? If samples
was a pandas
dataframe, I could simply do:
matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]
Is there a similarly fast alternative with lists?
答案1
得分: 1
import re
samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
matches = ['7895_001', '3458_669', '0345_123']
pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")
展开收缩
['0345_123_2.09_1.3_003']
如果你的样本中没有换行符,你也可以将它们转化为一个字符串,然后在模式周围使用 (?:).*
并使用 .findall()
。
不确定这是否会在速度上有所不同。
pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
pattern
re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
pattern.findall("\n".join(samples))
['0345_123_2.09_1.3_003']
英文:
You can do the same thing as in your pandas example.
import re
samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
matches = ['7895_001', '3458_669', '0345_123']
pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")
>>> [ s for s in samples if pattern.search(s) ]
['0345_123_2.09_1.3_003']
If there are no newlines in your samples - you could turn also turn that into a string and use .findall()
with (?:).*
around the pattern.
Not sure if that would make a difference speed-wise.
>>> pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
>>> pattern
re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
>>> pattern.findall("\n".join(samples))
['0345_123_2.09_1.3_003']
答案2
得分: -1
这看起来像一个双重循环,所以它不会很快。
首先,请意识到过早的优化通常被认为是不希望的。在这种情况下,您应该回答一个问题:对于您的用例来说,它足够快吗?
其次,观察到
matched_samples =
在很大程度上取决于您的数据。最坏的情况是没有一个匹配项,因为这将导致len(samples)*len(matches)
次子字符串检查,最好的情况是如果matches
的第一个元素是每个samples
元素的子字符串,那么您将得到len(samples)
次子字符串检查,因为any
在找到第一个真值后停止进一步处理。请注意,这意味着处理时间取决于匹配项的顺序。
英文:
> this looks like a double for loop, so it's not going to be fast.
Firstly please become aware of premature optimization which is generally considered not desired. In this case you should answer question: is it fast enough for your use case?
Secondly, observe that
matched_samples = 展开收缩
is greatly dependent at your data. Worst case is if there is not single match, as it result in len(samples)*len(matches)
of is substring checks, best case is if first element of matches
is substring of every element of samples
as you will then got len(samples)
of is substring checks, as any
cease futher processing after find 1st truth-y value. Observe that this mean processing time does depend at matches ordering.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论