How to search for one or more strings in a file using regex, and count the number of each string separately?

huangapple go评论69阅读模式
英文:

How to search for one or more strings in a file using regex, and count the number of each string separately?

问题

matches = re.finditer(Pattern, lines[i])
1count += sum(1 for _ in matches if _.group(1))
2count += sum(1 for _ in matches if _.group(2))
3count += sum(1 for _ in matches if _.group(3))
英文:

So I am trying to find one or more strings in each line of a file, and count the number of times each string comes up in total in the file. In some lines there is only one of the strings, however in other lines there may be multiple target strings, if that makes sense. I am trying to use a regular expression to do this.

So what I've tried is as follows (having already read the file in and separated it into lines using .readlines):

1count=0
2count=0
3count=0

Pattern=r'(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)'

i=0
while i!=len(lines) 
    match=re.search(pattern, lines[i]) 

    if match:
        if match.group(1):
            1count=1count+1
        elif match.group(2):
            2count=2count+1
        elif match.group(3):
            3count=3count+1
    i=i+1

This works when there is no multiple matches in the line, however when there is it obviously only counts the first match and then moves on. Is there a way for me to scan the whole line anyway? I know re.findall finds all matches, but it then puts them into an array, and I don't know how I would reliably count the number of matches for each word, since the matches in findall would have different indexes in the array each loop through.

答案1

得分: 1

你可以在最后使用findall并计算出现次数。
例如:

import re
count1 = 0
count2 = 0
count3 = 0
data = "String1 String2 String2 String3\nString1 String1\nString3"
Pattern = r'(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)'
lines = data.split('\n')
all_matches = []
i = 0
while i != len(lines):
    match = re.findall(Pattern, lines[i])
    all_matches.extend(match)
    i += 1
count1 = len([el for el in all_matches if el[0] == 'String1'])
count2 = len([el for el in all_matches if el[1] == 'String2'])
count3 = len([el for el in all_matches if el[2] == 'String3'])

print(count1, count2, count3)

注意findall将返回一个元组列表,其中元组的第一个项对应于第一个组,以此类推。

all_matches将是元组列表,每个元组的形状为(字符串1的匹配项,字符串2的匹配项,字符串3的匹配项)。如果没有匹配项,它将是'',类似于:

[('String1', '', ''), ('', 'String2', ''), ('', 'String2', ''), ...]

在计算count1时,例如,我们正在创建一个匹配String1的元素列表(条件是,元组的第一个元素等于'String1'),如下所示:

first_group = [el for el in all_matches if el[0] == 'String1']

然后我们返回这些元素的长度作为count1的值:

count1 = len(first_group)
英文:

You can use findall and count occurrences at the end.
For example:

import re
count1=0
count2=0
count3=0
data = "String1 String2 String2 String3\nString1 String1\nString3"
Pattern=r'(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)'
lines = data.split('\n')
all_matches = []
i = 0
while i!=len(lines): 
    match=re.findall(Pattern, lines[i])
    all_matches.extend(match)
    i += 1
count1 = len([el for el in all_matches if el[0] == 'String1'])
count2 = len([el for el in all_matches if el[1] == 'String2'])
count3 = len([el for el in all_matches if el[2] == 'String3'])
    
print(count1, count2, count3)

Note: findall will return a list of tuples, where first item of tuple corresponds to the first group, and so on.

all_matches will be list of tuples, each tuple is of the shape (matched item for string1, matched item for string2, matched item for string3)
if there isn't a matched, it will be ''
,something like that:

[('String1', '', ''), ('', 'String2', ''), ('', 'String2', ''), ...]

In calculating count1 for example, we are creating a list of elements which match String1(The condition here as we saw, that first element of the tuple equals 'String1') as follows:

first_group = [el for el in all_matches if el[0] == 'String1']

then we return its length as the value of count1length of those elements:

count1 = len(first_group)

答案2

得分: 1

在你的示例中,匹配项都是静态字符串,所以你可以将它们直接用作Counter对象的字典键。

import re
from collections import Counter

count = Counter()
for line in lines:
    for match in re.finditer(Pattern, line):
        count.update(match.group(0))

for k in count.keys():
    print(f"{c[k]} occurrences of {k}")

这里的一个有用的改变是使用re.finditer()而不是re.findall,它从中返回一个适当的re.Match对象,你可以用.group(0)提取匹配的字符串,以及其他各种属性,如果你愿意的话。

如果你需要提取可能包含变化的匹配项,比如r"c[ei]*ling"r"\d+",你不能将匹配的字符串用作字典键(因为Counter会将每个唯一的字符串视为单独的实体;所以你会得到"12 occurrences of 123"和"1 occurrence of 234",而不是"13 occurrences of \d+");在这种情况下,我可能会尝试使用具名子组。

    for match in re.finditer(r"(?P<ceiling>c[ei]*ling)|(?P<number>\d+)", line):
        matches = match.groupdict()
        for key in matches.keys():
            if matches[key] is not None:
                count.update(key)
英文:

In your example, the matches are all static strings, so you can just use them as dictionary keys for a Counter object.

import re
from collections import Counter

count = Counter()
for line in lines:
    for match in re.finditer(Pattern, line):
        count.update(match.group(0))

for k in count.keys():
    print(f&quot;{c[k]} occurrences of {k}&quot;)

Part of the useful changes here is using re.finditer() instead of re.findall which returns a proper re.Match object from which you can extract the matching string with .group(0) as well as various other attributes, should you wish to.

If you need to extract matches which could contain variations, like r&quot;c[ei]*ling&quot; or r&quot;\d+&quot;, you can't use the matched strings as dictionary keys (because then the Counter would count each unique string as a separate entity; so you would get "12 occurrences of 123" and "1 occurrence of 234" instead of "13 occurrences of \d+"); in that case, I would perhaps try to use named subgroups.

    for match in re.finditer(r&quot;(?P&lt;ceiling&gt;c[ei]*ling)|(?P&lt;number&gt;\d+)&quot;, line):
        matches = match.groupdict()
        for key in matches.keys():
            if matches[key] is not None:
                count.update(key)

答案3

得分: 1

另一种变体是使用 numpy 及其 count_nonzero 方法。由于无需将数据分隔为行,请假设所有数据都在 data 中:

import numpy as np
# 沿着轴 0 计算非空字符串的数量(每个单词的匹配项)
count = np.count_nonzero(np.array(re.findall(Pattern, data)), 0)
英文:

Just another variant is using numpy and its count_nonzero method. Since there's no need to separate the data into lines, let's assume it's all in data:

import numpy as np
# count non-empty strings along axis 0 (the matches for each word)
count = np.count_nonzero(np.array(re.findall(Pattern, data)), 0)

</details>



huangapple
  • 本文由 发表于 2023年4月19日 18:43:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76053588.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定