如何使用Python脚本将特定空格替换为换行符?

huangapple go评论71阅读模式
英文:

how to replace specific spaces with new line using python script?

问题

问题可能出在你的正则表达式替换模式中的双反斜杠 \1\2 上。在 Python 的正则表达式中,捕获组的引用应该是单反斜杠 \1\2,而不是双反斜杠 \\1\\2。所以,你需要将正则表达式模式的替换部分中的双反斜杠改为单反斜杠。

以下是已更正的代码:

import os
import re

# Define the directory path where the text files are located
directory = r'E:\Desktop\Copy'

# Define the regex patterns and their replacements
patterns = [
    (r'\(Netherlands\) \(United States\)', r'\n'),
    (r'\(United States\) \(Portugal\)', r'\n'),
    # ... 还有其他的模式
]

# Iterate over the files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.txt'):
        file_path = os.path.join(directory, filename)

        # Read the contents of the file
        with open(file_path, 'r', encoding='utf-8') as file:
            contents = file.read()

        # Apply the regex replacements
        for pattern, replacement in patterns:
            contents = re.sub(pattern, replacement, contents)

        # Write the modified contents back to the file
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(contents)

print("Replacements completed.")

通过将替换模式中的 \1\2 更改为单反斜杠,应该能够成功运行脚本并将所需的替换应用到文本文件中。

英文:

I have about 3000 number of txt files in a folder directory and now i want to replace some spaces in txt files with new line.
for example:

Ottoman Empire Spain
Japan Sweden
United States Tuscany
Brazil United States
Mexico United States
United States Brazil
United States Mexico
Romania Belgium
Spain Sweden
Spain United States
Sweden Belgium
Japan Kingdom of Italy

i want to convert above list to following:

Ottoman Empire
Spain
Japan
Sweden
United States
Tuscany
Brazil
United States
Mexico
United States
United States
Brazil
United States
Mexico
Romania
Belgium
Spain
Sweden
Spain
United States
Sweden
Belgium
Japan
Kingdom of Italy

I written following script for it:

import os
import re
# Define the directory path where the text files are located
directory = r'E:\Desktop\Copy'
# Define the regex patterns and their replacements
patterns = [
(r'\(Netherlands\) \(United States\)', r'\\1\n\\2'),
(r'\(United States\) \(Portugal\)', r'\\1\n\\2'),
(r'\(United States\) \(Greece\)', r'\\1\n\\2'),
(r'\(United States\) \(Romania\)', r'\\1\n\\2'),
(r'\(Baden\) \(Tuscany\)', r'\\1\n\\2'),
(r'\(Portugal\) \(United States\)', r'\\1\n\\2'),
(r'\(Netherlands\) \(Romania\)', r'\\1\n\\2'),
(r'\(Ottoman Empire\) \(Spain\)', r'\\1\n\\2'),
(r'\(Japan\) \(Sweden\)', r'\\1\n\\2'),
(r'\(United States\) \(Tuscany\)', r'\\1\n\\2'),
(r'\(Brazil\) \(United States\)', r'\\1\n\\2'),
(r'\(Mexico\) \(United States\)', r'\\1\n\\2'),
(r'\(United States\) \(Brazil\)', r'\\1\n\\2'),
(r'\(United States\) \(Mexico\)', r'\\1\n\\2'),
(r'\(Romania\) \(Belgium\)', r'\\1\n\\2'),
(r'\(Spain\) \(Sweden\)', r'\\1\n\\2'),
(r'\(Spain\) \(United States\)', r'\\1\n\\2'),
(r'\(Sweden\) \(Belgium\)', r'\\1\n\\2'),
(r'\(Japan\) \(Kingdom of Italy\)', r'\\1\n\\2'),
(r'\(Sweden\) \(Bulgaria\)', r'\\1\n\\2'),
(r'\(Romania\) \(Sweden\)', r'\\1\n\\2'),
(r'\(Japan\) \(Ottoman Empire\)', r'\\1\n\\2'),
(r'\(Spain\) \(Romania\)', r'\\1\n\\2'),
(r'\(Japan\) \(Ethiopia\)', r'\\1\n\\2'),
(r'\(Belgium\) \(Portugal\)', r'\\1\n\\2'),
(r'\(Japan\) \(Republic of China\)', r'\\1\n\\2'),
(r'\(Japan\) \(Romania\)', r'\\1\n\\2'),
(r'\(Ethiopia\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(Republic of China\) \(Austria-Hungary\)', r'\\1\n\\2'),
(r'\(Sweden\) \(Portugal\)', r'\\1\n\\2'),
(r'\(Czechoslovakia\) \(Ethiopia\)', r'\\1\n\\2'),
(r'\(Yugoslavia\) \(Czechoslovakia\)', r'\\1\n\\2'),
(r'\(Kingdom of Italy\) \(Russian SFSR\)', r'\\1\n\\2'),
(r'\(Spain\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(Ethiopia\) \(Belgium\)', r'\\1\n\\2'),
(r'\(Weimar Republic\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(N. Germany\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(Czechoslovakia\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(Romania\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(Kingdom of Italy\) \(Soviet Union\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Iran\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Greece\)', r'\\1\n\\2'),
(r'\(Greece\) \(Brazil\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Afghanistan\)', r'\\1\n\\2'),
(r'\(Afghanistan\) \(Mexico\)', r'\\1\n\\2'),
(r'\(Kingdom of Italy\) \(New Zealand\)', r'\\1\n\\2'),
(r'\(Saxony\) \(Baden\)', r'\\1\n\\2'),
(r'\(Poland\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(Poland\) \(Romania\)', r'\\1\n\\2'),
(r'\(Poland\) \(North Korea\)', r'\\1\n\\2'),
(r'\(Poland\) \(Brazil\)', r'\\1\n\\2'),
(r'\(Pakistan\) \(Italy\)', r'\\1\n\\2'),
(r'\(Czechoslovakia\) \(Romania\)', r'\\1\n\\2'),
(r'\(Spain\) \(Poland\)', r'\\1\n\\2'),
(r'\(India\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(Italy\) \(Poland\)', r'\\1\n\\2'),
(r'\(Czechoslovakia\) \(North Korea\)', r'\\1\n\\2'),
(r'\(Czechoslovakia\) \(Poland\)', r'\\1\n\\2'),
(r'\(Italy\) \(North Korea\)', r'\\1\n\\2'),
(r'\(Czechoslovakia\) \(Indonesia\)', r'\\1\n\\2'),
(r'\(Spain\) \(North Korea\)', r'\\1\n\\2'),
(r'\(Italy\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'Indonesia \(West Germany\)', r'\\1\n\\2'),
(r'\(Portugal\) \(Denmark\)', r'\\1\n\\2'),
(r'\(United States\) \(Baden\)', r'\\1\n\\2'),
(r'\(Spain\) \(Italy\)', r'\\1\n\\2'),
(r'\(Spain\) \(Indonesia\)', r'\\1\n\\2'),
(r'\(West Germany\) \(Czechoslovakia\)', r'\\1\n\\2'),
(r'\(Soviet Union\) \(United States\)', r'\\1\n\\2'),
(r'\(West Germany\) \(Spain\)', r'\\1\n\\2'),
(r'\(North Korea\) \(Yugoslavia\)', r'\\1\n\\2'),
(r'\(K. Two Sicilies\) \(Netherlands\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Saxony\)', r'\\1\n\\2'),
(r'\(Spain\) \(Vietnam\)', r'\\1\n\\2'),
(r'\(North Korea\) \(Spain\)', r'\\1\n\\2'),
(r'\(Mexico\) \(Brazil\)', r'\\1\n\\2'),
(r'\(West Germany\) \(North Korea\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Egypt\)', r'\\1\n\\2'),
(r'\(Pakistan\) \(Brazil\)', r'\\1\n\\2'),
(r'\(West Germany\) \(Brazil\)', r'\\1\n\\2'),
(r'\(Italy\) \(Brazil\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Spain\)', r'\\1\n\\2'),
(r'\(Italy\) \(Taiwan\)', r'\\1\n\\2'),
(r'\(Egypt\) \(Taiwan\)', r'\\1\n\\2'),
(r'\(Egypt\) \(Brazil\)', r'\\1\n\\2'),
(r'\(Pakistan\) \(Taiwan\)', r'\\1\n\\2'),
(r'\(Portugal\) \(Belgium\)', r'\\1\n\\2'),
(r'\(Iraq\) \(South Korea\)', r'\\1\n\\2'),
(r'\(Pakistan\) \(Poland\)', r'\\1\n\\2'),
(r'\(Pakistan\) \(Egypt\)', r'\\1\n\\2'),
(r'\(Italy\) \(West Germany\)', r'\\1\n\\2'),
(r'\(Portugal\) \(Sweden\)', r'\\1\n\\2'),
(r'\(Iran\) \(France\)', r'\\1\n\\2'),
(r'\(Italy\) \(Egypt\)', r'\\1\n\\2'),
(r'\(Iran\) \(Egypt\)', r'\\1\n\\2'),
(r'\(India\) \(North Korea\)', r'\\1\n\\2'),
(r'\(Ukraine\) \(Egypt\)', r'\\1\n\\2'),
(r'\(Iran\) \(Ukraine\)', r'\\1\n\\2'),
(r'\(Taiwan\) \(Egypt\)', r'\\1\n\\2'),
(r'\(Italy\) \(Iraq\)', r'\\1\n\\2'),
(r'\(Iraq\) \(France\)', r'\\1\n\\2'),
(r'\(Myanmar\) \(Taiwan\)', r'\\1\n\\2'),
(r'\(Syria\) \(Taiwan\)', r'\\1\n\\2'),
(r'\(Iraq\) \(Myanmar\)', r'\\1\n\\2'),
(r'\(Indonesia\) \(Thailand\)', r'\\1\n\\2'),
(r'\(Iraq\) \(Vietnam\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Indonesia\)', r'\\1\n\\2'),
(r'\(Indonesia\) \(France\)', r'\\1\n\\2'),
(r'\(Egypt\) \(Myanmar\)', r'\\1\n\\2'),
(r'\(Saxony\) \(United States\)', r'\\1\n\\2'),
(r'\(Saudi Arabia\) \(Thailand\)', r'\\1\n\\2'),
(r'\(Iran\) \(South Korea\)', r'\\1\n\\2'),
(r'\(K. Two Sicilies\) \(Denmark\)', r'\\1\n\\2'),
(r'\(Kingdom of Hanover\) \(United States\)', r'\\1\n\\2'),
(r'\(Netherlands\) \(Mexico\)', r'\\1\n\\2'),
(r'\(Belgium\) \(K. Two Sicilies\)', r'\\1\n\\2'),
(r'\(United States\) \(Kingdom of Hanover\)', r'\\1\n\\2'),
(r'\(United States\) \(Denmark\)', r'\\1\n\\2'),
(r'\(United States\) \(Belgium\)', r'\\1\n\\2'),
(r'\(Portugal\) \(Kingdom of Hanover\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Mexico\)', r'\\1\n\\2'),
(r'\(Saxony\) \(Denmark\)', r'\\1\n\\2'),
(r'\(Portugal\) \(Mexico\)', r'\\1\n\\2'),
(r'\(Brazil\) \(Kingdom of Hanover\)', r'\\1\n\\2'),
(r'\(Haiti\) \(United States\)', r'\\1\n\\2'),
(r'\(Spain\) \(Bavaria\)', r'\\1\n\\2'),
(r'\(Denmark\) \(Mexico\)', r'\\1\n\\2'),
(r'\(Denmark\) \(Saxony\)', r'\\1\n\\2'),
(r'\(Denmark\) \(Paraguay\)', r'\\1\n\\2'),
# Add more patterns here following the same format
]
# Iterate over the files in the directory
for filename in os.listdir(directory):
if filename.endswith('.txt'):
file_path = os.path.join(directory, filename)
# Read the contents of the file
with open(file_path, 'r', encoding='utf-8') as file:
contents = file.read()
# Apply the regex replacements
for pattern, replacement in patterns:
contents = re.sub(pattern, replacement, contents)
# Write the modified contents back to the file
with open(file_path, 'w', encoding='utf-8') as file:
file.write(contents)
print("Replacements completed.")

when i run this script i get Replacements completed. but when i check txt files then no any changes applied to txt files!

when i change \\1 and \\2 to \1 and \2 in my script then i get following errors:

Traceback (most recent call last):
File "E:\Desktop\scr\OCR\tetetert.py", line 152, in <module>
contents = re.sub(pattern, replacement, contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 185, in sub
return _compile(pattern, flags).sub(repl, string, count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 317, in _subx
template = _compile_repl(template, pattern)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 308, in _compile_repl
return _parser.parse_template(repl, pattern)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 1072, in parse_template
addgroup(int(this[1:]), len(this) - 1)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 1008, in addgroup
raise s.error("invalid group reference %d" % index, pos)
re.error: invalid group reference 1 at position 1

where is problem?

答案1

得分: 1

代码中有过多的反斜杠线:

  • 使用原始字符串时,\1 应该只是 \1,而不是 \\1
  • 分组不需要 \ - 只需使用 (...)
英文:

The code has an excess of backslashes:

  • With raw strings, \1 should be just \1 and not \\1
  • Groups don't need \ - you can use just (...).

答案2

得分: 1

手动列举成百上千个替换案例并不是解决这个问题的高效方法相反使用分支例如 `荷兰|美国|...`)匹配所有国家名称然后将匹配列表与 `\n` 连接起来

```py
countries = [
  '荷兰',
  '美国',
  '葡萄牙',
  '希腊',
  ...
]

def format_country_name_files(file):
  content = file.read()
  countries_escaped = (re.escape(country) for country in countries)
  matches = re.findall('|'.join(countries_escaped), content)
  
  file.seek(0)
  file.write('\n'.join(matches))
  file.truncate()

尝试一下:

with open('.txt', 'r+') as file:
  format_country_name_files(file)

# 文件内容:
'''
奥斯曼帝国
西班牙
日本
瑞典
美国
...
'''
英文:

Manually listing hundreds of replacing cases is not an efficient solution to this problem. Instead, match all country names using branches (e.g. Netherlands|United States|...) and join that list of matches with a \n:

countries = [
  'Netherlands',
  'United States',
  'Portugal',
  'Greece',
  ...
]

def format_country_name_files(file):
  content = file.read()
  countries_escaped = (re.escape(country) for country in countries)
  matches = re.findall('|'.join(countries_escaped), content)
  
  file.seek(0)
  file.write('\n'.join(matches))
  file.truncate()

Try it:

with open('.txt', 'r+') as file:
  format_country_name_files(file)

# File content:
'''
Ottoman Empire
Spain
Japan
Sweden
United States
...
'''

huangapple
  • 本文由 发表于 2023年6月22日 13:57:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76528939.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定