如何基于关键词为Python中的非结构化数据添加新行?

huangapple go评论63阅读模式
英文:

How can I add a new line based on keyword for unstructured data python?

问题

我有一些这样的文本:

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates

Allsopp_&_Allsopp:  No address

Lamprell:  No address

我的目标是为每个地址添加一个新行,使其看起来像这样:

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates

Allsopp_&_Allsopp:  No address

Lamprell:  No address

所以唯一的指示符是 :。问题出在文本换行上。

我尝试了以下方式:

with open('test.txt', 'r') as infile:
    data = infile.read()
final_list = []
for ind, val in enumerate(data.split('\n')):
    final_list.append(val)
    if val == ':':
        final_list.insert(-1, '\n')

我的逻辑在大多数情况下有效,但在一些包含:的字符串中失败,也在文本换行时失败。

你们能否建议我更好的方法来实现这个目标?

英文:

I've some text like this

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates

Allsopp_&_Allsopp:  No address

Lamprell:  No address

My aim is add a new line for every address. so that it will look this.

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates

Allsopp_&_Allsopp:  No address

Lamprell:  No address

so the only indicator that it's a new address is :. Here the issue is with text wrapping.

I'm trying like

with open('test.txt', 'r') as infile:
    data = infile.read()
final_list = []
for ind, val in enumerate(data.split('\n')):
    final_list.append(val)
    if val == ':':
        final_list.insert(-1, '\n')

My logic is working most of the time, but it is failing in some cases with strings having : in the middle and also fails if there is a text wrapping.

Can you guys suggest me any better way to do this?

答案1

得分: 3

使用正则表达式替换 re.sub 在地址标题(格式为 <换行><某个地址标题>:)中识别的部分。

import re

txt = '''your_input_text'''  # 假设这是你的文本
new_text = re.sub(r'\n[^\s:]+:', r'\n\g<0>', txt)
print(new_text)

输出:

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia

Allsopp_&_Allsopp:  No address

Lamprell:  No address
英文:

Use regex substitution with re.sub on address title (recognized in format &lt;line break&gt;&lt;Some_address_title&gt;:)

import re

txt = &#39;&#39;&#39;your_input_text&#39;&#39;&#39;  # assuming your text
new_text = re.sub(r&#39;\n[^\s:]+:&#39;, r&#39;\n\g&lt;0&gt;&#39;, txt)
print(new_text)

Output:

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate&#39;s tourism offering and
infrastructure, both domestically and abroad.
Wikipedia

Allsopp_&amp;_Allsopp:  No address

Lamprell:  No address

答案2

得分: 1

阅读输入文件逐行读取。将每行分割成以空格为分隔的标记。如果第一个标记以冒号结尾,则将换行符写入输出文件,但第一行除外。类似于:

INPUT = '/Volumes/G-Drive/input.txt'
OUTPUT = '/Volumes/G-Drive/output.txt'

with open(INPUT) as _input, open(OUTPUT, 'w') as _output:
    for i, line in enumerate(_input):
        if (tokens := line.split()) and tokens[0][-1] == ':' and i > 0:
            _output.write('\n')
        _output.write(line)
英文:

Read the input file line by line. Split each line into whitespace delimited tokens. If the first token ends with colon write a newline to the output file except for the first line. Something like this:

INPUT = &#39;/Volumes/G-Drive/input.txt&#39;
OUTPUT = &#39;/Volumes/G-Drive/output.txt&#39;

with open(INPUT) as _input, open(OUTPUT, &#39;w&#39;) as _output:
    for i, line in enumerate(_input):
        if (tokens := line.split()) and tokens[0][-1] == &#39;:&#39; and i &gt; 0:
            _output.write(&#39;\n&#39;)
        _output.write(line)

答案3

得分: 0

以下是翻译好的代码部分:

formatted_data = []
for ind, val in enumerate(data.splitlines()):
    if ":" in val:
        val = "\n" + val
    formatted_data.append(val)

print("\n".join(formatted_data))
英文:

The simpler solution would be:

formatted_data = []
for ind, val in enumerate(data.splitlines()):
    if &quot;:&quot; in val:
        val = &quot;\n&quot; + val
    formatted_data.append(val)

print(&quot;\n&quot;.join(formatted_data))

huangapple
  • 本文由 发表于 2023年7月3日 22:19:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76605626.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定