英文:
How can I replace matches in a Python regex with a modified version of the match?
问题
我编写了这段代码来搜索特定文件夹中的文本文件,查找单词匹配项并进行指定:
import re, os, sys
from pathlib import Path
# 用法:regs 目录
try:
if len(sys.argv) == 2:
folder = sys.argv[1]
fList = os.listdir(folder)
uInput = input('输入正则表达式: ')
regObj = re.compile(f'''{uInput}''')
wordReg = re.compile(r'''([A-Za-z0-9]+|\s+|[^\w\s]+)''')
matches = []
print(fList)
for file in fList:
if not os.path.isdir(Path(folder)/Path(file)):
currentFileObj = open(f'{folder}/{file}')
content = currentFileObj.readlines()
currentFileObj.seek(0)
text = currentFileObj.read()
words = wordReg.findall(text)
matches = list(filter(regObj.match, words))
instances = 0
print(f"匹配项 ({file}):\n'", end='')
for word in words:
if word in matches:
print("\u0333".join(f"{word} "), end='')
else:
print(word, end='')
print("'")
for line in content:
matches = regObj.findall(line)
for match in matches:
print("\u0333".join(f"{match} "), end=' ')
print(f"in line number {content.index(line)+1}")
if match != '':
instances = instances + 1
print(f'找到的实例数: {instances}\n')
else:
continue
else:
print('用法: regs 目录')
except FileNotFoundError:
print("该文件不存在.")
except PermissionError:
print("您没有权限搜索该文件夹.")
它在大多数情况下都有效,除了一些正则表达式。如果正则表达式在其他字符旁边有标点符号或空格字符,它将不会下划线。如果我找到一种方法来用匹配项的修改版本替换匹配项(将匹配项替换为下划线版本),可能会起作用。有人知道解决方法吗?
我尝试查找可以将匹配项替换为修改后的匹配项的函数,但似乎没有找到。还有一些小问题,它无法正确地下划线空格和标点符号,并且下划线字符不会在Windows 7命令提示符中显示,也许除了下划线以外的其他字符可以工作?
英文:
I wrote this code to search a specific folder's text files for word matches and to specify them:
import re, os, sys
from pathlib import Path
#Usage: regs directory
try:
if len(sys.argv) == 2:
folder = sys.argv[1]
fList = os.listdir(folder)
uInput = input('input a regex: ')
regObj = re.compile(f'''{uInput}''')
wordReg = re.compile(r'''([A-Za-z0-9]+|\s+|[^\w\s]+)''')
matches = []
print(fList)
for file in fList:
if not os.path.isdir(Path(folder)/Path(file)):
currentFileObj = open(f'{folder}/{file}')
content = currentFileObj.readlines()
currentFileObj.seek(0)
text = currentFileObj.read()
words = wordReg.findall(text)
matches = list(filter(regObj.match, words))
instances = 0
print(f"matches in ({file}):\n'", end='')
for word in words:
if word in matches:
print("\u0333".join(f"{word} "), end='')
else:
print(word, end='')
print("'")
for line in content:
matches = regObj.findall(line)
for match in matches:
print("\u0333".join(f"{match} "), end=' ')
print(f"in line number {content.index(line)+1}")
if match != '':
instances = instances + 1
print(f'number of instances found: {instances}\n')
else:
continue
else:
print('Usage: regs directory')
except FileNotFoundError:
print("that file doesn't exist.")
except PermissionError:
print("you don't have permission to search that folder.")
it works for the most part except for a few regular expressions, if the regular expression has punctuation or a white space character next to other characters it wouldn't underline it, it may work if i find out a way to substitute matches with a modified version of the match (replacing the match with an underlined version)
Anyone knows a fix ?
here's what it looks like for any other regex.
you can see in the first text file it doesn't underline the match (out.)
i tried looking for functions that would substitute matches with a modification of said match, doesn't appear like there's any ?
also there's the minor problems of it not being able to underline whitespaces and punctuation properly, and the underline character doesn't appear in the windows7 command prompt, maybe a different character other than the underline can work ?
答案1
得分: 0
如果您的目标是在代码中强调匹配项,您可以修改打印逻辑,使用\u0332
将匹配项替换为下划线版本,如下所示:
underlined_match = "\u0332".join(f"{match}\u0332")
print(underlined_match, end=' ')
如果您的目标是更改正则表达式以捕获普通字符(a-z0-9)之间的标点符号和空白字符,则以下正则表达式可能会对您有所帮助:
(?:[A-Za-z0-9]+(?:[^\w\s]*[A-Za-z0-9]+[^\w\s]*)*)|(?:[^\w\s]+)
英文:
If your goal is to underline the matches in your code, you can modify the printing logic to replace the matches with an underlined version by using \u0332
so
underlined_match = "\u0332".join(f"{match}\u0332")
print(underlined_match, end=' ')
Else if your goal is to change the regex so that it captures punctuation marks and blanks between normal characters (a-z0-9) then this regex might help you
(?:[A-Za-z0-9]+(?:[^\w\s]*[A-Za-z0-9]+[^\w\s]*)*)|(?:[^\w\s]+)
答案2
得分: 0
我已经找到答案:
使用lambda函数作为re.sub
的repl=
变量,我能够修改匹配项,然后使用它们进行替换。
import re, os, sys
from pathlib import Path
# 用法:regs 目录
try:
if len(sys.argv) == 2:
folder = sys.argv[1]
fList = os.listdir(folder)
print("文件夹内容:", end=' ')
for f in fList:
if not f == fList[-1]:
print(f, end=', ')
else:
print(f, end='.\n\n')
uInput = input('输入正则表达式: ')
print()
regObj = re.compile(f'''{uInput}''')
wordReg = re.compile(r'''([A-Za-z0-9]+|\s+|[^\w\s]+)''')
matches = []
for file in fList:
if os.path.isfile(Path(folder)/Path(file)):
currentFileObj = open(f'{folder}/{file}')
lines = currentFileObj.readlines()
currentFileObj.seek(0)
text = currentFileObj.read()
words = wordReg.findall(text)
matches = list(filter(regObj.match, words))
instances = 0
print(f"在 ({file}) 中的匹配项:\n'", end='')
print(regObj.sub(lambda match: "(" + match.group() + ")", text)+"'")
for line in lines:
matches = regObj.findall(line)
for match in matches:
print(f"({match})", end=' ')
print(f"在第 {lines.index(line)+1} 行")
if match != '':
instances = instances + 1
print(f'找到的实例数量: {instances}\n')
else:
continue
else:
print('用法: regs 目录')
except FileNotFoundError:
print("该文件不存在。")
except PermissionError:
print("您没有权限搜索该文件夹。")
而不是遍历字符串单词列表的循环,它只会打印括号之间的匹配组,如下所示:
print(regObj.sub(lambda match: "(" + match.group() + ")", text)+"'")
现在它还会打印文件夹的内容。
英文:
I've figured out the answer:
using a lambda function as a repl=
variable with re.sub
i was capable of modifying the matches and then using them to substitute.
import re, os, sys
from pathlib import Path
#Usage:regs directory
try:
if len(sys.argv) == 2:
folder = sys.argv[1]
fList = os.listdir(folder)
print("folder contents: ", end=' ')
for f in fList:
if not f == fList[-1]:
print(f, end=', ')
else:
print(f, end='.\n\n')
uInput = input('input a regex: ')
print()
regObj = re.compile(f'''{uInput}''')
wordReg = re.compile(r'''([A-Za-z0-9]+|\s+|[^\w\s]+)''')
matches = []
for file in fList:
if os.path.isfile(Path(folder)/Path(file)):
currentFileObj = open(f'{folder}/{file}')
lines = currentFileObj.readlines()
currentFileObj.seek(0)
text = currentFileObj.read()
words = wordReg.findall(text)
matches = list(filter(regObj.match, words))
instances = 0
print(f"matches in ({file}):\n'", end='')
print(regObj.sub(lambda match: "(" + match.group() + ")", text)+"'")
for line in lines:
matches = regObj.findall(line)
for match in matches:
print((f"({match})"), end=' ')
print(f"in line number {lines.index(line)+1}")
if match != '':
instances = instances + 1
print(f'number of instances found: {instances}\n')
else:
continue
else:
print('Usage:regs directory')
except FileNotFoundError:
print("that file doesn't exist.")
except PermissionError:
print("you don't have permission to search that folder.")
instead of having a loop that goes over the list of the string's words, it just prints the match group between parenthesis like so:
print(regObj.sub(lambda match: "(" + match.group() + ")", text)+"'")
The output now looks like this.
it also prints the folder contents now.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论