在偶数行查找不存在特定正则表达式的文件。

huangapple go评论158阅读模式
英文:

Find files that don't exist a specific regex in even lines

问题

你的脚本问题可能出在正则表达式的匹配和判断条件上。以下是一些可能导致脚本不工作的问题:

  1. 正则表达式中的HTML实体:在你的脚本中,正则表达式中包含了HTML实体编码(如&lt;)。请确保将这些实体编码替换为它们的正常字符表示形式。例如,将(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d)改为(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d)

  2. 文件名检查:在find_invalid_files函数中,你正在检查文件名是否以.txt结尾,但实际上你应该检查文件的内容是否符合规定格式。要做到这一点,你需要读取文件的内容并检查它们,而不仅仅是文件名。

  3. 奇偶行检查:在is_valid_file函数中,你通过计算行数的奇偶性来检查行的格式。然而,这只是一种简单的方法,不能保证文件中的行交替出现。你需要根据实际的规则来检查行,而不是依赖于奇偶性。

为了解决这些问题,你可以修改你的脚本如下:

  1. import os
  2. import re
  3. def is_valid_line(line):
  4. return re.match(r'^[A-Z]+$', line) is not None
  5. def is_valid_file(file_path):
  6. with open(file_path, 'r') as file:
  7. lines = file.readlines()
  8. if len(lines) < 3:
  9. return False
  10. for i, line in enumerate(lines):
  11. if i % 2 == 0: # Even lines
  12. if not is_valid_line(line.strip()):
  13. return False
  14. else: # Odd lines
  15. if not re.match(r'^.*?(?<!\d)(?<!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*$', line.strip()):
  16. return False
  17. # Check if the last line is a four-digit number
  18. last_line = lines[-1].strip()
  19. if not re.match(r'^\d{4}$', last_line):
  20. return False
  21. return True
  22. def find_invalid_files(directory_path):
  23. invalid_files = []
  24. for file_name in os.listdir(directory_path):
  25. file_path = os.path.join(directory_path, file_name)
  26. if file_name.endswith('.txt') and not is_valid_file(file_path):
  27. invalid_files.append(file_name)
  28. return invalid_files
  29. if __name__ == "__main__":
  30. directory_path = r"E:\Desktop\social\Output_folder"
  31. invalid_files = find_invalid_files(directory_path)
  32. report_file = "invalid_files_report.txt"
  33. with open(report_file, "w") as f:
  34. if invalid_files:
  35. f.write("The following files do not follow the specified format:\n")
  36. for file_name in invalid_files:
  37. f.write(file_name + "\n")
  38. else:
  39. f.write("All files in the directory follow the specified format.\n")
  40. print("Report generated. Check 'invalid_files_report.txt' for details.")

这个修改后的脚本应该能够按照你的规则正确检查文件是否符合要求的格式。

英文:

I have high number of txt files in E:\Desktop\social\Output_folder directory and files must have a format like following list:

  1. Botelt
  2. 2,006,910
  3. Classtertmates
  4. 932,977
  5. SiretexDettegrees
  6. 740,025
  7. PlantrthyhetAll
  8. 410,810
  9. theGkykyulobe
  10. 316,409
  11. NOVEMBER
  12. 1997

This means that the files must have the following characteristics:

  1. Only odd lines must contain letters.
  2. even lines must contain only front regex: ^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*
  3. latest non-empty lines must contain only a 4 digit number like 2020 or 2014 (year format)
  4. Multiple number of my regex lines cannot be placed in consecutive.
  5. Multiple number of letter lines cannot be placed in consecutive.

Now I need a regex that find files in E:\Desktop\social\Output_folder directory that have not above characteristics. for example following list:

  1. QrtQrt
  2. 316,935,269
  3. Frtaceertbrtortok
  4. 220,138,444
  5. Reertdertdertit
  6. 113,759,355
  7. YourtretTrtuertbete
  8. 87,035,728
  9. Tatjjuygguked
  10. 85,739,300
  11. MyshtyhSpyrtyactye
  12. 81,000,349
  13. Ftyryriendttyysteyr
  14. 71,734,802
  15. 560,492,430
  16. 51,682,046
  17. Tutymrtybrtylr
  18. 51,245,350
  19. Crtyltyatrysrtysmarytetys
  20. 41,314,645
  21. Tjyozytonyje
  22. 38
  23. VtyyjKyjontyjaktyje
  24. 29,011,910
  25. JUNE
  26. 2009

If you look at the example above, 71,734,802 and 560,492,430 and 51,682,046 are in consecutive.

I wrote following python script that must check my directory files and find files with incorrect characteristics:

  1. import os
  2. import re
  3. def is_valid_line(line, is_even):
  4. if is_even:
  5. return re.match(r&#39;^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*$&#39;, line)
  6. else:
  7. return re.match(r&#39;^[A-Z]&#39;, line)
  8. def is_valid_file(file_path):
  9. with open(file_path, &#39;r&#39;) as file:
  10. lines = file.readlines()
  11. if len(lines) % 2 == 0:
  12. return False
  13. for i, line in enumerate(lines):
  14. is_even = i % 2 == 0
  15. if not is_valid_line(line.strip(), is_even):
  16. return False
  17. # Check if the last line is a four-digit number
  18. last_line = lines[-1].strip()
  19. if not re.match(r&#39;^\d{4}$&#39;, last_line):
  20. return False
  21. return True
  22. def find_invalid_files(directory_path):
  23. invalid_files = []
  24. for file_name in os.listdir(directory_path):
  25. if file_name.endswith(&#39;.txt&#39;):
  26. file_path = os.path.join(directory_path, file_name)
  27. if not is_valid_file(file_path):
  28. invalid_files.append(file_name)
  29. return invalid_files
  30. if __name__ == &quot;__main__&quot;:
  31. directory_path = r&quot;E:\Desktop\social\Output_folder&quot;
  32. invalid_files = find_invalid_files(directory_path)
  33. report_file = &quot;invalid_files_report.txt&quot;
  34. with open(report_file, &quot;w&quot;) as f:
  35. if invalid_files:
  36. f.write(&quot;The following files do not follow the specified format:\n&quot;)
  37. for file_name in invalid_files:
  38. f.write(file_name + &quot;\n&quot;)
  39. else:
  40. f.write(&quot;All files in the directory follow the specified format.\n&quot;)
  41. print(&quot;Report generated. Check &#39;invalid_files_report.txt&#39; for details.&quot;)

but my script not working and report me all files names.
where is my script problem?

答案1

得分: 1

  1. ^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*

不匹配四位数(*),因此它始终不会匹配最后一行。

你需要避免使用这个模式测试最后一行。例如,使用

  1. for i, line in enumerate(lines[:-1]):

(*) 尝试失败。我无法解析这个模式,无法解释为什么它不适用于四位数。

英文:
  1. ^.*?(?&lt;!\d)(?&lt;!\d,)(\d{1,3}(?:,\d{3})*)(?!,?\d).*

never matches a four-digit number(*), and thus it will always fail for the last line.

You need to avoid testing the last line with this pattern. For example, with

  1. for i, line in enumerate(lines[:-1]):

(*) from trying out. I can't parse that pattern well enough to explain why it doesn't work for a four-digit number.

huangapple
  • 本文由 发表于 2023年7月27日 20:38:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76779821.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定