英文:
How can I add a new line based on keyword for unstructured data python?
问题
我有一些这样的文本:
Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group: No address
International_Cricket_Council: No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&_Allsopp: No address
Lamprell: No address
我的目标是为每个地址添加一个新行,使其看起来像这样:
Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group: No address
International_Cricket_Council: No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&_Allsopp: No address
Lamprell: No address
所以唯一的指示符是 :
。问题出在文本换行上。
我尝试了以下方式:
with open('test.txt', 'r') as infile:
data = infile.read()
final_list = []
for ind, val in enumerate(data.split('\n')):
final_list.append(val)
if val == ':':
final_list.insert(-1, '\n')
我的逻辑在大多数情况下有效,但在一些包含:
的字符串中失败,也在文本换行时失败。
你们能否建议我更好的方法来实现这个目标?
英文:
I've some text like this
Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group: No address
International_Cricket_Council: No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&_Allsopp: No address
Lamprell: No address
My aim is add a new line for every address. so that it will look this.
Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group: No address
International_Cricket_Council: No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&_Allsopp: No address
Lamprell: No address
so the only indicator that it's a new address is :
. Here the issue is with text wrapping.
I'm trying like
with open('test.txt', 'r') as infile:
data = infile.read()
final_list = []
for ind, val in enumerate(data.split('\n')):
final_list.append(val)
if val == ':':
final_list.insert(-1, '\n')
My logic is working most of the time, but it is failing in some cases with strings having :
in the middle and also fails if there is a text wrapping.
Can you guys suggest me any better way to do this?
答案1
得分: 3
使用正则表达式替换 re.sub
在地址标题(格式为 <换行><某个地址标题>:
)中识别的部分。
import re
txt = '''your_input_text''' # 假设这是你的文本
new_text = re.sub(r'\n[^\s:]+:', r'\n\g<0>', txt)
print(new_text)
输出:
Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group: No address
International_Cricket_Council: No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia
Allsopp_&_Allsopp: No address
Lamprell: No address
英文:
Use regex substitution with re.sub
on address title (recognized in format <line break><Some_address_title>:
)
import re
txt = '''your_input_text''' # assuming your text
new_text = re.sub(r'\n[^\s:]+:', r'\n\g<0>', txt)
print(new_text)
Output:
Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group: No address
International_Cricket_Council: No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia
Allsopp_&_Allsopp: No address
Lamprell: No address
答案2
得分: 1
阅读输入文件逐行读取。将每行分割成以空格为分隔的标记。如果第一个标记以冒号结尾,则将换行符写入输出文件,但第一行除外。类似于:
INPUT = '/Volumes/G-Drive/input.txt'
OUTPUT = '/Volumes/G-Drive/output.txt'
with open(INPUT) as _input, open(OUTPUT, 'w') as _output:
for i, line in enumerate(_input):
if (tokens := line.split()) and tokens[0][-1] == ':' and i > 0:
_output.write('\n')
_output.write(line)
英文:
Read the input file line by line. Split each line into whitespace delimited tokens. If the first token ends with colon write a newline to the output file except for the first line. Something like this:
INPUT = '/Volumes/G-Drive/input.txt'
OUTPUT = '/Volumes/G-Drive/output.txt'
with open(INPUT) as _input, open(OUTPUT, 'w') as _output:
for i, line in enumerate(_input):
if (tokens := line.split()) and tokens[0][-1] == ':' and i > 0:
_output.write('\n')
_output.write(line)
答案3
得分: 0
以下是翻译好的代码部分:
formatted_data = []
for ind, val in enumerate(data.splitlines()):
if ":" in val:
val = "\n" + val
formatted_data.append(val)
print("\n".join(formatted_data))
英文:
The simpler solution would be:
formatted_data = []
for ind, val in enumerate(data.splitlines()):
if ":" in val:
val = "\n" + val
formatted_data.append(val)
print("\n".join(formatted_data))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论