如何使用Python脚本将特定空格替换为换行符?

huangapple go评论99阅读模式
英文:

how to replace specific spaces with new line using python script?

问题

问题可能出在你的正则表达式替换模式中的双反斜杠 \1\2 上。在 Python 的正则表达式中,捕获组的引用应该是单反斜杠 \1\2,而不是双反斜杠 \\1\\2。所以,你需要将正则表达式模式的替换部分中的双反斜杠改为单反斜杠。

以下是已更正的代码:

  1. import os
  2. import re
  3. # Define the directory path where the text files are located
  4. directory = r'E:\Desktop\Copy'
  5. # Define the regex patterns and their replacements
  6. patterns = [
  7. (r'\(Netherlands\) \(United States\)', r'\n'),
  8. (r'\(United States\) \(Portugal\)', r'\n'),
  9. # ... 还有其他的模式
  10. ]
  11. # Iterate over the files in the directory
  12. for filename in os.listdir(directory):
  13. if filename.endswith('.txt'):
  14. file_path = os.path.join(directory, filename)
  15. # Read the contents of the file
  16. with open(file_path, 'r', encoding='utf-8') as file:
  17. contents = file.read()
  18. # Apply the regex replacements
  19. for pattern, replacement in patterns:
  20. contents = re.sub(pattern, replacement, contents)
  21. # Write the modified contents back to the file
  22. with open(file_path, 'w', encoding='utf-8') as file:
  23. file.write(contents)
  24. print("Replacements completed.")

通过将替换模式中的 \1\2 更改为单反斜杠,应该能够成功运行脚本并将所需的替换应用到文本文件中。

英文:

I have about 3000 number of txt files in a folder directory and now i want to replace some spaces in txt files with new line.
for example:

  1. Ottoman Empire Spain
  2. Japan Sweden
  3. United States Tuscany
  4. Brazil United States
  5. Mexico United States
  6. United States Brazil
  7. United States Mexico
  8. Romania Belgium
  9. Spain Sweden
  10. Spain United States
  11. Sweden Belgium
  12. Japan Kingdom of Italy

i want to convert above list to following:

  1. Ottoman Empire
  2. Spain
  3. Japan
  4. Sweden
  5. United States
  6. Tuscany
  7. Brazil
  8. United States
  9. Mexico
  10. United States
  11. United States
  12. Brazil
  13. United States
  14. Mexico
  15. Romania
  16. Belgium
  17. Spain
  18. Sweden
  19. Spain
  20. United States
  21. Sweden
  22. Belgium
  23. Japan
  24. Kingdom of Italy

I written following script for it:

  1. import os
  2. import re
  3. # Define the directory path where the text files are located
  4. directory = r'E:\Desktop\Copy'
  5. # Define the regex patterns and their replacements
  6. patterns = [
  7. (r'\(Netherlands\) \(United States\)', r'\\1\n\\2'),
  8. (r'\(United States\) \(Portugal\)', r'\\1\n\\2'),
  9. (r'\(United States\) \(Greece\)', r'\\1\n\\2'),
  10. (r'\(United States\) \(Romania\)', r'\\1\n\\2'),
  11. (r'\(Baden\) \(Tuscany\)', r'\\1\n\\2'),
  12. (r'\(Portugal\) \(United States\)', r'\\1\n\\2'),
  13. (r'\(Netherlands\) \(Romania\)', r'\\1\n\\2'),
  14. (r'\(Ottoman Empire\) \(Spain\)', r'\\1\n\\2'),
  15. (r'\(Japan\) \(Sweden\)', r'\\1\n\\2'),
  16. (r'\(United States\) \(Tuscany\)', r'\\1\n\\2'),
  17. (r'\(Brazil\) \(United States\)', r'\\1\n\\2'),
  18. (r'\(Mexico\) \(United States\)', r'\\1\n\\2'),
  19. (r'\(United States\) \(Brazil\)', r'\\1\n\\2'),
  20. (r'\(United States\) \(Mexico\)', r'\\1\n\\2'),
  21. (r'\(Romania\) \(Belgium\)', r'\\1\n\\2'),
  22. (r'\(Spain\) \(Sweden\)', r'\\1\n\\2'),
  23. (r'\(Spain\) \(United States\)', r'\\1\n\\2'),
  24. (r'\(Sweden\) \(Belgium\)', r'\\1\n\\2'),
  25. (r'\(Japan\) \(Kingdom of Italy\)', r'\\1\n\\2'),
  26. (r'\(Sweden\) \(Bulgaria\)', r'\\1\n\\2'),
  27. (r'\(Romania\) \(Sweden\)', r'\\1\n\\2'),
  28. (r'\(Japan\) \(Ottoman Empire\)', r'\\1\n\\2'),
  29. (r'\(Spain\) \(Romania\)', r'\\1\n\\2'),
  30. (r'\(Japan\) \(Ethiopia\)', r'\\1\n\\2'),
  31. (r'\(Belgium\) \(Portugal\)', r'\\1\n\\2'),
  32. (r'\(Japan\) \(Republic of China\)', r'\\1\n\\2'),
  33. (r'\(Japan\) \(Romania\)', r'\\1\n\\2'),
  34. (r'\(Ethiopia\) \(Yugoslavia\)', r'\\1\n\\2'),
  35. (r'\(Republic of China\) \(Austria-Hungary\)', r'\\1\n\\2'),
  36. (r'\(Sweden\) \(Portugal\)', r'\\1\n\\2'),
  37. (r'\(Czechoslovakia\) \(Ethiopia\)', r'\\1\n\\2'),
  38. (r'\(Yugoslavia\) \(Czechoslovakia\)', r'\\1\n\\2'),
  39. (r'\(Kingdom of Italy\) \(Russian SFSR\)', r'\\1\n\\2'),
  40. (r'\(Spain\) \(Yugoslavia\)', r'\\1\n\\2'),
  41. (r'\(Ethiopia\) \(Belgium\)', r'\\1\n\\2'),
  42. (r'\(Weimar Republic\) \(Yugoslavia\)', r'\\1\n\\2'),
  43. (r'\(N. Germany\) \(Yugoslavia\)', r'\\1\n\\2'),
  44. (r'\(Czechoslovakia\) \(Yugoslavia\)', r'\\1\n\\2'),
  45. (r'\(Romania\) \(Yugoslavia\)', r'\\1\n\\2'),
  46. (r'\(Kingdom of Italy\) \(Soviet Union\)', r'\\1\n\\2'),
  47. (r'\(Brazil\) \(Iran\)', r'\\1\n\\2'),
  48. (r'\(Brazil\) \(Greece\)', r'\\1\n\\2'),
  49. (r'\(Greece\) \(Brazil\)', r'\\1\n\\2'),
  50. (r'\(Brazil\) \(Afghanistan\)', r'\\1\n\\2'),
  51. (r'\(Afghanistan\) \(Mexico\)', r'\\1\n\\2'),
  52. (r'\(Kingdom of Italy\) \(New Zealand\)', r'\\1\n\\2'),
  53. (r'\(Saxony\) \(Baden\)', r'\\1\n\\2'),
  54. (r'\(Poland\) \(Yugoslavia\)', r'\\1\n\\2'),
  55. (r'\(Poland\) \(Romania\)', r'\\1\n\\2'),
  56. (r'\(Poland\) \(North Korea\)', r'\\1\n\\2'),
  57. (r'\(Poland\) \(Brazil\)', r'\\1\n\\2'),
  58. (r'\(Pakistan\) \(Italy\)', r'\\1\n\\2'),
  59. (r'\(Czechoslovakia\) \(Romania\)', r'\\1\n\\2'),
  60. (r'\(Spain\) \(Poland\)', r'\\1\n\\2'),
  61. (r'\(India\) \(Yugoslavia\)', r'\\1\n\\2'),
  62. (r'\(Italy\) \(Poland\)', r'\\1\n\\2'),
  63. (r'\(Czechoslovakia\) \(North Korea\)', r'\\1\n\\2'),
  64. (r'\(Czechoslovakia\) \(Poland\)', r'\\1\n\\2'),
  65. (r'\(Italy\) \(North Korea\)', r'\\1\n\\2'),
  66. (r'\(Czechoslovakia\) \(Indonesia\)', r'\\1\n\\2'),
  67. (r'\(Spain\) \(North Korea\)', r'\\1\n\\2'),
  68. (r'\(Italy\) \(Yugoslavia\)', r'\\1\n\\2'),
  69. (r'Indonesia \(West Germany\)', r'\\1\n\\2'),
  70. (r'\(Portugal\) \(Denmark\)', r'\\1\n\\2'),
  71. (r'\(United States\) \(Baden\)', r'\\1\n\\2'),
  72. (r'\(Spain\) \(Italy\)', r'\\1\n\\2'),
  73. (r'\(Spain\) \(Indonesia\)', r'\\1\n\\2'),
  74. (r'\(West Germany\) \(Czechoslovakia\)', r'\\1\n\\2'),
  75. (r'\(Soviet Union\) \(United States\)', r'\\1\n\\2'),
  76. (r'\(West Germany\) \(Spain\)', r'\\1\n\\2'),
  77. (r'\(North Korea\) \(Yugoslavia\)', r'\\1\n\\2'),
  78. (r'\(K. Two Sicilies\) \(Netherlands\)', r'\\1\n\\2'),
  79. (r'\(Brazil\) \(Saxony\)', r'\\1\n\\2'),
  80. (r'\(Spain\) \(Vietnam\)', r'\\1\n\\2'),
  81. (r'\(North Korea\) \(Spain\)', r'\\1\n\\2'),
  82. (r'\(Mexico\) \(Brazil\)', r'\\1\n\\2'),
  83. (r'\(West Germany\) \(North Korea\)', r'\\1\n\\2'),
  84. (r'\(Brazil\) \(Egypt\)', r'\\1\n\\2'),
  85. (r'\(Pakistan\) \(Brazil\)', r'\\1\n\\2'),
  86. (r'\(West Germany\) \(Brazil\)', r'\\1\n\\2'),
  87. (r'\(Italy\) \(Brazil\)', r'\\1\n\\2'),
  88. (r'\(Brazil\) \(Spain\)', r'\\1\n\\2'),
  89. (r'\(Italy\) \(Taiwan\)', r'\\1\n\\2'),
  90. (r'\(Egypt\) \(Taiwan\)', r'\\1\n\\2'),
  91. (r'\(Egypt\) \(Brazil\)', r'\\1\n\\2'),
  92. (r'\(Pakistan\) \(Taiwan\)', r'\\1\n\\2'),
  93. (r'\(Portugal\) \(Belgium\)', r'\\1\n\\2'),
  94. (r'\(Iraq\) \(South Korea\)', r'\\1\n\\2'),
  95. (r'\(Pakistan\) \(Poland\)', r'\\1\n\\2'),
  96. (r'\(Pakistan\) \(Egypt\)', r'\\1\n\\2'),
  97. (r'\(Italy\) \(West Germany\)', r'\\1\n\\2'),
  98. (r'\(Portugal\) \(Sweden\)', r'\\1\n\\2'),
  99. (r'\(Iran\) \(France\)', r'\\1\n\\2'),
  100. (r'\(Italy\) \(Egypt\)', r'\\1\n\\2'),
  101. (r'\(Iran\) \(Egypt\)', r'\\1\n\\2'),
  102. (r'\(India\) \(North Korea\)', r'\\1\n\\2'),
  103. (r'\(Ukraine\) \(Egypt\)', r'\\1\n\\2'),
  104. (r'\(Iran\) \(Ukraine\)', r'\\1\n\\2'),
  105. (r'\(Taiwan\) \(Egypt\)', r'\\1\n\\2'),
  106. (r'\(Italy\) \(Iraq\)', r'\\1\n\\2'),
  107. (r'\(Iraq\) \(France\)', r'\\1\n\\2'),
  108. (r'\(Myanmar\) \(Taiwan\)', r'\\1\n\\2'),
  109. (r'\(Syria\) \(Taiwan\)', r'\\1\n\\2'),
  110. (r'\(Iraq\) \(Myanmar\)', r'\\1\n\\2'),
  111. (r'\(Indonesia\) \(Thailand\)', r'\\1\n\\2'),
  112. (r'\(Iraq\) \(Vietnam\)', r'\\1\n\\2'),
  113. (r'\(Brazil\) \(Indonesia\)', r'\\1\n\\2'),
  114. (r'\(Indonesia\) \(France\)', r'\\1\n\\2'),
  115. (r'\(Egypt\) \(Myanmar\)', r'\\1\n\\2'),
  116. (r'\(Saxony\) \(United States\)', r'\\1\n\\2'),
  117. (r'\(Saudi Arabia\) \(Thailand\)', r'\\1\n\\2'),
  118. (r'\(Iran\) \(South Korea\)', r'\\1\n\\2'),
  119. (r'\(K. Two Sicilies\) \(Denmark\)', r'\\1\n\\2'),
  120. (r'\(Kingdom of Hanover\) \(United States\)', r'\\1\n\\2'),
  121. (r'\(Netherlands\) \(Mexico\)', r'\\1\n\\2'),
  122. (r'\(Belgium\) \(K. Two Sicilies\)', r'\\1\n\\2'),
  123. (r'\(United States\) \(Kingdom of Hanover\)', r'\\1\n\\2'),
  124. (r'\(United States\) \(Denmark\)', r'\\1\n\\2'),
  125. (r'\(United States\) \(Belgium\)', r'\\1\n\\2'),
  126. (r'\(Portugal\) \(Kingdom of Hanover\)', r'\\1\n\\2'),
  127. (r'\(Brazil\) \(Mexico\)', r'\\1\n\\2'),
  128. (r'\(Saxony\) \(Denmark\)', r'\\1\n\\2'),
  129. (r'\(Portugal\) \(Mexico\)', r'\\1\n\\2'),
  130. (r'\(Brazil\) \(Kingdom of Hanover\)', r'\\1\n\\2'),
  131. (r'\(Haiti\) \(United States\)', r'\\1\n\\2'),
  132. (r'\(Spain\) \(Bavaria\)', r'\\1\n\\2'),
  133. (r'\(Denmark\) \(Mexico\)', r'\\1\n\\2'),
  134. (r'\(Denmark\) \(Saxony\)', r'\\1\n\\2'),
  135. (r'\(Denmark\) \(Paraguay\)', r'\\1\n\\2'),
  136. # Add more patterns here following the same format
  137. ]
  138. # Iterate over the files in the directory
  139. for filename in os.listdir(directory):
  140. if filename.endswith('.txt'):
  141. file_path = os.path.join(directory, filename)
  142. # Read the contents of the file
  143. with open(file_path, 'r', encoding='utf-8') as file:
  144. contents = file.read()
  145. # Apply the regex replacements
  146. for pattern, replacement in patterns:
  147. contents = re.sub(pattern, replacement, contents)
  148. # Write the modified contents back to the file
  149. with open(file_path, 'w', encoding='utf-8') as file:
  150. file.write(contents)
  151. print("Replacements completed.")

when i run this script i get Replacements completed. but when i check txt files then no any changes applied to txt files!

when i change \\1 and \\2 to \1 and \2 in my script then i get following errors:

  1. Traceback (most recent call last):
  2. File "E:\Desktop\scr\OCR\tetetert.py", line 152, in <module>
  3. contents = re.sub(pattern, replacement, contents)
  4. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  5. File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 185, in sub
  6. return _compile(pattern, flags).sub(repl, string, count)
  7. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  8. File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 317, in _subx
  9. template = _compile_repl(template, pattern)
  10. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  11. File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 308, in _compile_repl
  12. return _parser.parse_template(repl, pattern)
  13. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  14. File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 1072, in parse_template
  15. addgroup(int(this[1:]), len(this) - 1)
  16. File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 1008, in addgroup
  17. raise s.error("invalid group reference %d" % index, pos)
  18. re.error: invalid group reference 1 at position 1

where is problem?

答案1

得分: 1

代码中有过多的反斜杠线:

  • 使用原始字符串时,\1 应该只是 \1,而不是 \\1
  • 分组不需要 \ - 只需使用 (...)
英文:

The code has an excess of backslashes:

  • With raw strings, \1 should be just \1 and not \\1
  • Groups don't need \ - you can use just (...).

答案2

得分: 1

  1. 手动列举成百上千个替换案例并不是解决这个问题的高效方法相反使用分支例如 `荷兰|美国|...`匹配所有国家名称然后将匹配列表与 `\n` 连接起来
  2. ```py
  3. countries = [
  4. '荷兰',
  5. '美国',
  6. '葡萄牙',
  7. '希腊',
  8. ...
  9. ]
  10. def format_country_name_files(file):
  11. content = file.read()
  12. countries_escaped = (re.escape(country) for country in countries)
  13. matches = re.findall('|'.join(countries_escaped), content)
  14. file.seek(0)
  15. file.write('\n'.join(matches))
  16. file.truncate()

尝试一下:

  1. with open('.txt', 'r+') as file:
  2. format_country_name_files(file)
  3. # 文件内容:
  4. '''
  5. 奥斯曼帝国
  6. 西班牙
  7. 日本
  8. 瑞典
  9. 美国
  10. ...
  11. '''
英文:

Manually listing hundreds of replacing cases is not an efficient solution to this problem. Instead, match all country names using branches (e.g. Netherlands|United States|...) and join that list of matches with a \n:

  1. countries = [
  2. 'Netherlands',
  3. 'United States',
  4. 'Portugal',
  5. 'Greece',
  6. ...
  7. ]
  8. def format_country_name_files(file):
  9. content = file.read()
  10. countries_escaped = (re.escape(country) for country in countries)
  11. matches = re.findall('|'.join(countries_escaped), content)
  12. file.seek(0)
  13. file.write('\n'.join(matches))
  14. file.truncate()

Try it:

  1. with open('.txt', 'r+') as file:
  2. format_country_name_files(file)
  3. # File content:
  4. '''
  5. Ottoman Empire
  6. Spain
  7. Japan
  8. Sweden
  9. United States
  10. ...
  11. '''

huangapple
  • 本文由 发表于 2023年6月22日 13:57:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76528939.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定