英文:
Find files that don't exist a specific regex in even lines
问题
你的脚本问题可能出在正则表达式的匹配和判断条件上。以下是一些可能导致脚本不工作的问题:
-
正则表达式中的HTML实体:在你的脚本中,正则表达式中包含了HTML实体编码(如
<
)。请确保将这些实体编码替换为它们的正常字符表示形式。例如,将(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d)
改为(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d)
。 -
文件名检查:在
find_invalid_files
函数中,你正在检查文件名是否以.txt
结尾,但实际上你应该检查文件的内容是否符合规定格式。要做到这一点,你需要读取文件的内容并检查它们,而不仅仅是文件名。 -
奇偶行检查:在
is_valid_file
函数中,你通过计算行数的奇偶性来检查行的格式。然而,这只是一种简单的方法,不能保证文件中的行交替出现。你需要根据实际的规则来检查行,而不是依赖于奇偶性。
为了解决这些问题,你可以修改你的脚本如下:
import os
import re
def is_valid_line(line):
return re.match(r'^[A-Z]+$', line) is not None
def is_valid_file(file_path):
with open(file_path, 'r') as file:
lines = file.readlines()
if len(lines) < 3:
return False
for i, line in enumerate(lines):
if i % 2 == 0: # Even lines
if not is_valid_line(line.strip()):
return False
else: # Odd lines
if not re.match(r'^.*?(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*$', line.strip()):
return False
# Check if the last line is a four-digit number
last_line = lines[-1].strip()
if not re.match(r'^\d{4}$', last_line):
return False
return True
def find_invalid_files(directory_path):
invalid_files = []
for file_name in os.listdir(directory_path):
file_path = os.path.join(directory_path, file_name)
if file_name.endswith('.txt') and not is_valid_file(file_path):
invalid_files.append(file_name)
return invalid_files
if __name__ == "__main__":
directory_path = r"E:\Desktop\social\Output_folder"
invalid_files = find_invalid_files(directory_path)
report_file = "invalid_files_report.txt"
with open(report_file, "w") as f:
if invalid_files:
f.write("The following files do not follow the specified format:\n")
for file_name in invalid_files:
f.write(file_name + "\n")
else:
f.write("All files in the directory follow the specified format.\n")
print("Report generated. Check 'invalid_files_report.txt' for details.")
这个修改后的脚本应该能够按照你的规则正确检查文件是否符合要求的格式。
英文:
I have high number of txt files in E:\Desktop\social\Output_folder
directory and files must have a format like following list:
Botelt
2,006,910
Classtertmates
932,977
SiretexDettegrees
740,025
PlantrthyhetAll
410,810
theGkykyulobe
316,409
NOVEMBER
1997
This means that the files must have the following characteristics:
- Only odd lines must contain letters.
- even lines must contain only front regex:
^.*?(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*
- latest non-empty lines must contain only a 4 digit number like 2020 or 2014 (year format)
- Multiple number of my regex lines cannot be placed in consecutive.
- Multiple number of letter lines cannot be placed in consecutive.
Now I need a regex that find files in E:\Desktop\social\Output_folder
directory that have not above characteristics. for example following list:
QrtQrt
316,935,269
Frtaceertbrtortok
220,138,444
Reertdertdertit
113,759,355
YourtretTrtuertbete
87,035,728
Tatjjuygguked
85,739,300
MyshtyhSpyrtyactye
81,000,349
Ftyryriendttyysteyr
71,734,802
560,492,430
51,682,046
Tutymrtybrtylr
51,245,350
Crtyltyatrysrtysmarytetys
41,314,645
Tjyozytonyje
38
VtyyjKyjontyjaktyje
29,011,910
JUNE
2009
If you look at the example above, 71,734,802
and 560,492,430
and 51,682,046
are in consecutive.
I wrote following python script that must check my directory files and find files with incorrect characteristics:
import os
import re
def is_valid_line(line, is_even):
if is_even:
return re.match(r'^.*?(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*$', line)
else:
return re.match(r'^[A-Z]', line)
def is_valid_file(file_path):
with open(file_path, 'r') as file:
lines = file.readlines()
if len(lines) % 2 == 0:
return False
for i, line in enumerate(lines):
is_even = i % 2 == 0
if not is_valid_line(line.strip(), is_even):
return False
# Check if the last line is a four-digit number
last_line = lines[-1].strip()
if not re.match(r'^\d{4}$', last_line):
return False
return True
def find_invalid_files(directory_path):
invalid_files = []
for file_name in os.listdir(directory_path):
if file_name.endswith('.txt'):
file_path = os.path.join(directory_path, file_name)
if not is_valid_file(file_path):
invalid_files.append(file_name)
return invalid_files
if __name__ == "__main__":
directory_path = r"E:\Desktop\social\Output_folder"
invalid_files = find_invalid_files(directory_path)
report_file = "invalid_files_report.txt"
with open(report_file, "w") as f:
if invalid_files:
f.write("The following files do not follow the specified format:\n")
for file_name in invalid_files:
f.write(file_name + "\n")
else:
f.write("All files in the directory follow the specified format.\n")
print("Report generated. Check 'invalid_files_report.txt' for details.")
but my script not working and report me all files names.
where is my script problem?
答案1
得分: 1
^.*?(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*
不匹配四位数(*),因此它始终不会匹配最后一行。
你需要避免使用这个模式测试最后一行。例如,使用
for i, line in enumerate(lines[:-1]):
(*) 尝试失败。我无法解析这个模式,无法解释为什么它不适用于四位数。
英文:
^.*?(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*
never matches a four-digit number(*), and thus it will always fail for the last line.
You need to avoid testing the last line with this pattern. For example, with
for i, line in enumerate(lines[:-1]):
(*) from trying out. I can't parse that pattern well enough to explain why it doesn't work for a four-digit number.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论