基于常见的开头子串对字符串列表进行频率分布

huangapple go评论57阅读模式
英文:

Frequency distribution in a list of strings based on common beginning substring

问题

以下是翻译好的部分:

"Is there a way to get the common substrings that form the beginning of a list of strings?
The length of Substring must be greater than 1.

For instance, if there is a list L:

[ "hello", "help", "helium", "habit", "habitat", "hazard"]

The output should be a frequency distribution should be:

he - 3
hel - 3
habit - 2
ha - 3

I can later remove the duplicates formed by the same strings by again applying the same logic to look at the common beginning substrings from the distribution, checking their counts, and if the counts are same, consider only the longest string.

i.e., We get 'he' and 'hel' formed by the same set of words, having the same counts. Hence, picking only "hel" since it is the longest string"

英文:

Is there a way to get the common substrings that form the beginning of a list of strings?
The length of Substring must be greater than 1.

For instance, if there is a list L:

[ "hello", "help", "helium", "habit", "habitat", "hazard"]

The output should be a frequency distribution should be:

he - 3
hel - 3
habit - 2
ha - 3

I can later remove the duplicates formed by the same strings by again applying the same logic to look at the common beginning substrings from the distribution, checking their counts, and if the counts are same, consider only the longest string.

i.e., We get 'he' and 'hel' formed by the same set of words, having the same counts. Hence, picking only "hel" since it is the longest string

答案1

得分: 1

你可以使用 collections.Counter 来统计子字符串的重复次数,只需自己创建子字符串。

以下是一行代码,通过使用 itertools.chain.from_iterable 来创建和消耗子字符串,将子字符串的生成器展平成一个供计数器消耗的长生成器。

from collections import Counter
from itertools import chain
input_list = ["hello", "help", "helium", "habit", "habitat", "hazard"]

counter = Counter(
    chain.from_iterable(
        (entry[:i] for i in range(2,len(entry))) for entry in input_list
    )
)

counter_filtered = {x:y for x,y in counter.items() if y != 1}  # 移除单个条目

print(counter_filtered)
{'he': 3, 'hel': 3, 'ha': 3, 'hab': 2, 'habi': 2}
英文:

you can use collections.Counter to count repeatitions of substrings, and you just have to create substrings yourself.

a one liner would be to create an consume substrings by using itertools.chain.from_iterable, which would flatten the generators of substrings into one long generator for the counter to consume as follows.

from collections import Counter
from itertools import chain
input_list = [ "hello", "help", "helium", "habit", "habitat", "hazard"]

counter = Counter(
    chain.from_iterable(
        (entry[:i] for i in range(2,len(entry))) for entry in input_list
    )
)

counter_filtered = {x:y for x,y in counter.items() if y != 1}  # remove single entries

print(counter_filtered)
{'he': 3, 'hel': 3, 'ha': 3, 'hab': 2, 'habi': 2}

huangapple
  • 本文由 发表于 2023年4月4日 04:02:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923358.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定