2023年7月3日 22:19:00go评论94阅读模式

英文:

How can I add a new line based on keyword for unstructured data python?

问题

我有一些这样的文本：

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group:  No address
International_Cricket_Council:  No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate&#39;s tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&amp;_Allsopp:  No address
Lamprell:  No address

我的目标是为每个地址添加一个新行，使其看起来像这样：

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group:  No address
International_Cricket_Council:  No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate&#39;s tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&amp;_Allsopp:  No address
Lamprell:  No address

所以唯一的指示符是 :。问题出在文本换行上。

我尝试了以下方式：

with open('test.txt', 'r') as infile:
    data = infile.read()
final_list = []
for ind, val in enumerate(data.split('\n')):
    final_list.append(val)
    if val == ':':
        final_list.insert(-1, '\n')

我的逻辑在大多数情况下有效，但在一些包含:的字符串中失败，也在文本换行时失败。

你们能否建议我更好的方法来实现这个目标？

英文:

I've some text like this

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group:  No address
International_Cricket_Council:  No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate&#39;s tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&amp;_Allsopp:  No address
Lamprell:  No address

My aim is add a new line for every address. so that it will look this.

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group:  No address
International_Cricket_Council:  No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate&#39;s tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&amp;_Allsopp:  No address
Lamprell:  No address

so the only indicator that it's a new address is :. Here the issue is with text wrapping.

I'm trying like

with open(&#39;test.txt&#39;, &#39;r&#39;) as infile:
    data = infile.read()
final_list = []
for ind, val in enumerate(data.split(&#39;\n&#39;)):
    final_list.append(val)
    if val == &#39;:&#39;:
        final_list.insert(-1, &#39;\n&#39;)

My logic is working most of the time, but it is failing in some cases with strings having : in the middle and also fails if there is a text wrapping.

Can you guys suggest me any better way to do this?

答案1

得分: 3

使用正则表达式替换 re.sub 在地址标题（格式为 <换行><某个地址标题>:）中识别的部分。

import re
txt = '''your_input_text'''  # 假设这是你的文本
new_text = re.sub(r'\n[^\s:]+:', r'\n\g<0>', txt)
print(new_text)

输出：

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group:  No address
International_Cricket_Council:  No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia
Allsopp_&_Allsopp:  No address
Lamprell:  No address

英文:

Use regex substitution with re.sub on address title (recognized in format <line break><Some_address_title>:)

import re
txt = &#39;&#39;&#39;your_input_text&#39;&#39;&#39;  # assuming your text
new_text = re.sub(r&#39;\n[^\s:]+:&#39;, r&#39;\n\g&lt;0&gt;&#39;, txt)
print(new_text)

Output:

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group:  No address
International_Cricket_Council:  No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate&#39;s tourism offering and
infrastructure, both domestically and abroad.
Wikipedia
Allsopp_&amp;_Allsopp:  No address
Lamprell:  No address

答案2

得分: 1

阅读输入文件逐行读取。将每行分割成以空格为分隔的标记。如果第一个标记以冒号结尾，则将换行符写入输出文件，但第一行除外。类似于：

INPUT = '/Volumes/G-Drive/input.txt'
OUTPUT = '/Volumes/G-Drive/output.txt'
with open(INPUT) as _input, open(OUTPUT, 'w') as _output:
    for i, line in enumerate(_input):
        if (tokens := line.split()) and tokens[0][-1] == ':' and i > 0:
            _output.write('\n')
        _output.write(line)

英文:

Read the input file line by line. Split each line into whitespace delimited tokens. If the first token ends with colon write a newline to the output file except for the first line. Something like this:

INPUT = &#39;/Volumes/G-Drive/input.txt&#39;
OUTPUT = &#39;/Volumes/G-Drive/output.txt&#39;
with open(INPUT) as _input, open(OUTPUT, &#39;w&#39;) as _output:
    for i, line in enumerate(_input):
        if (tokens := line.split()) and tokens[0][-1] == &#39;:&#39; and i &gt; 0:
            _output.write(&#39;\n&#39;)
        _output.write(line)

答案3

得分: 0

以下是翻译好的代码部分：

formatted_data = []
for ind, val in enumerate(data.splitlines()):
    if ":" in val:
        val = "\n" + val
    formatted_data.append(val)
print("\n".join(formatted_data))

英文:

The simpler solution would be:

formatted_data = []
for ind, val in enumerate(data.splitlines()):
    if &quot;:&quot; in val:
        val = &quot;\n&quot; + val
    formatted_data.append(val)
print(&quot;\n&quot;.join(formatted_data))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何基于关键词为Python中的非结构化数据添加新行？

问题

答案1

答案2

答案3

`AttributeError`在使用tikzplotlib绘制图例时发生。

在数据框中每列的出现次数。

Feature importance scores with GridSearchCV

代数表达式的符号简化，由复数组成

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。