2023年4月19日 18:43:54go评论54阅读模式

英文:

How to search for one or more strings in a file using regex, and count the number of each string separately?

问题

matches = re.finditer(Pattern, lines[i])
1count += sum(1 for _ in matches if _.group(1))
2count += sum(1 for _ in matches if _.group(2))
3count += sum(1 for _ in matches if _.group(3))

英文:

So I am trying to find one or more strings in each line of a file, and count the number of times each string comes up in total in the file. In some lines there is only one of the strings, however in other lines there may be multiple target strings, if that makes sense. I am trying to use a regular expression to do this.

So what I've tried is as follows (having already read the file in and separated it into lines using .readlines):

1count=0
2count=0
3count=0

Pattern=r&#39;(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)&#39;

i=0
while i!=len(lines) 
    match=re.search(pattern, lines[i]) 

    if match:
        if match.group(1):
            1count=1count+1
        elif match.group(2):
            2count=2count+1
        elif match.group(3):
            3count=3count+1
    i=i+1

This works when there is no multiple matches in the line, however when there is it obviously only counts the first match and then moves on. Is there a way for me to scan the whole line anyway? I know re.findall finds all matches, but it then puts them into an array, and I don't know how I would reliably count the number of matches for each word, since the matches in findall would have different indexes in the array each loop through.

答案1

得分: 1

你可以在最后使用findall并计算出现次数。
例如：

import re
count1 = 0
count2 = 0
count3 = 0
data = "String1 String2 String2 String3\nString1 String1\nString3"
Pattern = r'(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)'
lines = data.split('\n')
all_matches = []
i = 0
while i != len(lines):
    match = re.findall(Pattern, lines[i])
    all_matches.extend(match)
    i += 1
count1 = len([el for el in all_matches if el[0] == 'String1'])
count2 = len([el for el in all_matches if el[1] == 'String2'])
count3 = len([el for el in all_matches if el[2] == 'String3'])

print(count1, count2, count3)

注意：findall将返回一个元组列表，其中元组的第一个项对应于第一个组，以此类推。

all_matches将是元组列表，每个元组的形状为（字符串1的匹配项，字符串2的匹配项，字符串3的匹配项）。如果没有匹配项，它将是''，类似于：

[('String1', '', ''), ('', 'String2', ''), ('', 'String2', ''), ...]

在计算count1时，例如，我们正在创建一个匹配String1的元素列表（条件是，元组的第一个元素等于'String1'），如下所示：

first_group = [el for el in all_matches if el[0] == 'String1']

然后我们返回这些元素的长度作为count1的值：

count1 = len(first_group)

英文:

You can use findall and count occurrences at the end.
For example:

import re
count1=0
count2=0
count3=0
data = &quot;String1 String2 String2 String3\nString1 String1\nString3&quot;
Pattern=r&#39;(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)&#39;
lines = data.split(&#39;\n&#39;)
all_matches = []
i = 0
while i!=len(lines): 
    match=re.findall(Pattern, lines[i])
    all_matches.extend(match)
    i += 1
count1 = len([el for el in all_matches if el[0] == &#39;String1&#39;])
count2 = len([el for el in all_matches if el[1] == &#39;String2&#39;])
count3 = len([el for el in all_matches if el[2] == &#39;String3&#39;])
    
print(count1, count2, count3)

Note: findall will return a list of tuples, where first item of tuple corresponds to the first group, and so on.

all_matches will be list of tuples, each tuple is of the shape (matched item for string1, matched item for string2, matched item for string3)
if there isn't a matched, it will be ''
,something like that:

[(&#39;String1&#39;, &#39;&#39;, &#39;&#39;), (&#39;&#39;, &#39;String2&#39;, &#39;&#39;), (&#39;&#39;, &#39;String2&#39;, &#39;&#39;), ...]

In calculating count1 for example, we are creating a list of elements which match String1(The condition here as we saw, that first element of the tuple equals 'String1') as follows:

first_group = [el for el in all_matches if el[0] == &#39;String1&#39;]

then we return its length as the value of count1length of those elements:

count1 = len(first_group)

答案2

得分: 1

在你的示例中，匹配项都是静态字符串，所以你可以将它们直接用作Counter对象的字典键。

import re
from collections import Counter

count = Counter()
for line in lines:
    for match in re.finditer(Pattern, line):
        count.update(match.group(0))

for k in count.keys():
    print(f"{c[k]} occurrences of {k}")

这里的一个有用的改变是使用re.finditer()而不是re.findall，它从中返回一个适当的re.Match对象，你可以用.group(0)提取匹配的字符串，以及其他各种属性，如果你愿意的话。

如果你需要提取可能包含变化的匹配项，比如r"c[ei]*ling"或r"\d+"，你不能将匹配的字符串用作字典键（因为Counter会将每个唯一的字符串视为单独的实体；所以你会得到"12 occurrences of 123"和"1 occurrence of 234"，而不是"13 occurrences of \d+"）；在这种情况下，我可能会尝试使用具名子组。

    for match in re.finditer(r"(?P<ceiling>c[ei]*ling)|(?P<number>\d+)", line):
        matches = match.groupdict()
        for key in matches.keys():
            if matches[key] is not None:
                count.update(key)

英文:

In your example, the matches are all static strings, so you can just use them as dictionary keys for a Counter object.

import re
from collections import Counter

count = Counter()
for line in lines:
    for match in re.finditer(Pattern, line):
        count.update(match.group(0))

for k in count.keys():
    print(f&quot;{c[k]} occurrences of {k}&quot;)

Part of the useful changes here is using re.finditer() instead of re.findall which returns a proper re.Match object from which you can extract the matching string with .group(0) as well as various other attributes, should you wish to.

If you need to extract matches which could contain variations, like r"c[ei]*ling" or r"\d+", you can't use the matched strings as dictionary keys (because then the Counter would count each unique string as a separate entity; so you would get "12 occurrences of 123" and "1 occurrence of 234" instead of "13 occurrences of \d+"); in that case, I would perhaps try to use named subgroups.

    for match in re.finditer(r&quot;(?P&lt;ceiling&gt;c[ei]*ling)|(?P&lt;number&gt;\d+)&quot;, line):
        matches = match.groupdict()
        for key in matches.keys():
            if matches[key] is not None:
                count.update(key)

答案3

得分: 1

另一种变体是使用 numpy 及其 count_nonzero 方法。由于无需将数据分隔为行，请假设所有数据都在 data 中：

import numpy as np
# 沿着轴 0 计算非空字符串的数量（每个单词的匹配项）
count = np.count_nonzero(np.array(re.findall(Pattern, data)), 0)

英文:

Just another variant is using numpy and its count_nonzero method. Since there's no need to separate the data into lines, let's assume it's all in data:

import numpy as np
# count non-empty strings along axis 0 (the matches for each word)
count = np.count_nonzero(np.array(re.findall(Pattern, data)), 0)

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to search for one or more strings in a file using regex, and count the number of each string separately?

问题

答案1

答案2

答案3

新的 @task 装饰器在 Airflow 中是否可以用来实现回调？

根据包含的字符串合并两个数据框，无需迭代器。

如何在重力粒子模拟中考虑静止在地面上的粒子？

如何从txt文件中获取列名并放入pandas数据框中？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论