在偶数行查找不存在特定正则表达式的文件。

huangapple go评论109阅读模式
英文:

Find files that don't exist a specific regex in even lines

问题

你的脚本问题可能出在正则表达式的匹配和判断条件上。以下是一些可能导致脚本不工作的问题:

  1. 正则表达式中的HTML实体:在你的脚本中,正则表达式中包含了HTML实体编码(如&lt;)。请确保将这些实体编码替换为它们的正常字符表示形式。例如,将(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d)改为(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d)

  2. 文件名检查:在find_invalid_files函数中,你正在检查文件名是否以.txt结尾,但实际上你应该检查文件的内容是否符合规定格式。要做到这一点,你需要读取文件的内容并检查它们,而不仅仅是文件名。

  3. 奇偶行检查:在is_valid_file函数中,你通过计算行数的奇偶性来检查行的格式。然而,这只是一种简单的方法,不能保证文件中的行交替出现。你需要根据实际的规则来检查行,而不是依赖于奇偶性。

为了解决这些问题,你可以修改你的脚本如下:

import os
import re

def is_valid_line(line):
    return re.match(r'^[A-Z]+$', line) is not None

def is_valid_file(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

        if len(lines) < 3:
            return False

        for i, line in enumerate(lines):
            if i % 2 == 0:  # Even lines
                if not is_valid_line(line.strip()):
                    return False
            else:  # Odd lines
                if not re.match(r'^.*?(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*$', line.strip()):
                    return False

        # Check if the last line is a four-digit number
        last_line = lines[-1].strip()
        if not re.match(r'^\d{4}$', last_line):
            return False

        return True

def find_invalid_files(directory_path):
    invalid_files = []
    for file_name in os.listdir(directory_path):
        file_path = os.path.join(directory_path, file_name)
        if file_name.endswith('.txt') and not is_valid_file(file_path):
            invalid_files.append(file_name)
    return invalid_files

if __name__ == "__main__":
    directory_path = r"E:\Desktop\social\Output_folder"
    invalid_files = find_invalid_files(directory_path)

    report_file = "invalid_files_report.txt"
    with open(report_file, "w") as f:
        if invalid_files:
            f.write("The following files do not follow the specified format:\n")
            for file_name in invalid_files:
                f.write(file_name + "\n")
        else:
            f.write("All files in the directory follow the specified format.\n")

    print("Report generated. Check 'invalid_files_report.txt' for details.")

这个修改后的脚本应该能够按照你的规则正确检查文件是否符合要求的格式。

英文:

I have high number of txt files in E:\Desktop\social\Output_folder directory and files must have a format like following list:

Botelt
2,006,910
Classtertmates
932,977
SiretexDettegrees
740,025
PlantrthyhetAll
410,810
theGkykyulobe
316,409
NOVEMBER
1997

This means that the files must have the following characteristics:

  1. Only odd lines must contain letters.
  2. even lines must contain only front regex: ^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*
  3. latest non-empty lines must contain only a 4 digit number like 2020 or 2014 (year format)
  4. Multiple number of my regex lines cannot be placed in consecutive.
  5. Multiple number of letter lines cannot be placed in consecutive.

Now I need a regex that find files in E:\Desktop\social\Output_folder directory that have not above characteristics. for example following list:

QrtQrt
316,935,269
Frtaceertbrtortok
220,138,444
Reertdertdertit
113,759,355
YourtretTrtuertbete
87,035,728
Tatjjuygguked
85,739,300
MyshtyhSpyrtyactye
81,000,349
Ftyryriendttyysteyr
71,734,802
560,492,430
51,682,046
Tutymrtybrtylr
51,245,350
Crtyltyatrysrtysmarytetys
41,314,645
Tjyozytonyje
38
VtyyjKyjontyjaktyje
29,011,910
JUNE
2009

If you look at the example above, 71,734,802 and 560,492,430 and 51,682,046 are in consecutive.

I wrote following python script that must check my directory files and find files with incorrect characteristics:

import os
import re

def is_valid_line(line, is_even):
    if is_even:
        return re.match(r&#39;^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*$&#39;, line)
    else:
        return re.match(r&#39;^[A-Z]&#39;, line)

def is_valid_file(file_path):
    with open(file_path, &#39;r&#39;) as file:
        lines = file.readlines()

        if len(lines) % 2 == 0:
            return False

        for i, line in enumerate(lines):
            is_even = i % 2 == 0
            if not is_valid_line(line.strip(), is_even):
                return False

        # Check if the last line is a four-digit number
        last_line = lines[-1].strip()
        if not re.match(r&#39;^\d{4}$&#39;, last_line):
            return False

        return True

def find_invalid_files(directory_path):
    invalid_files = []
    for file_name in os.listdir(directory_path):
        if file_name.endswith(&#39;.txt&#39;):
            file_path = os.path.join(directory_path, file_name)
            if not is_valid_file(file_path):
                invalid_files.append(file_name)
    return invalid_files

if __name__ == &quot;__main__&quot;:
    directory_path = r&quot;E:\Desktop\social\Output_folder&quot;
    invalid_files = find_invalid_files(directory_path)

    report_file = &quot;invalid_files_report.txt&quot;
    with open(report_file, &quot;w&quot;) as f:
        if invalid_files:
            f.write(&quot;The following files do not follow the specified format:\n&quot;)
            for file_name in invalid_files:
                f.write(file_name + &quot;\n&quot;)
        else:
            f.write(&quot;All files in the directory follow the specified format.\n&quot;)

    print(&quot;Report generated. Check &#39;invalid_files_report.txt&#39; for details.&quot;)

but my script not working and report me all files names.
where is my script problem?

答案1

得分: 1

^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*

不匹配四位数(*),因此它始终不会匹配最后一行。

你需要避免使用这个模式测试最后一行。例如,使用

for i, line in enumerate(lines[:-1]):

(*) 尝试失败。我无法解析这个模式,无法解释为什么它不适用于四位数。

英文:
^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*

never matches a four-digit number(*), and thus it will always fail for the last line.

You need to avoid testing the last line with this pattern. For example, with

for i, line in enumerate(lines[:-1]):

(*) from trying out. I can't parse that pattern well enough to explain why it doesn't work for a four-digit number.

huangapple
  • 本文由 发表于 2023年7月27日 20:38:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76779821.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定