英文:
Python - turns string into dictionary where keys are subheadings and values are links
问题
pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")
link_dict = {}
current_subheading = None
for line in lines:
if line.startswith("----"):
current_subheading = line.replace("----", "").strip()
link_dict[current_subheading] = []
elif current_subheading:
link_dict[current_subheading].append(line.strip())
英文:
In the middle of some text I have the following.
Some random text before.
----CAPITAL WORDS:
first subheading
https://link1
https://link2
second subheading
https://link3
third subheading
https://link4
https://link5
https://link6
https://link7
----MORE CAPITAL WORDS:
Some random text after.
I would like to extract the string between ----CAPITAL WORDS:
and ----MORE CAPITAL WORDS
and store it in a dictionary as follows
{
'first subheading': ["https://link1", "https://link2"],
'second subheading': ["https://link3"]
'third subheading': ["https://link4", "https://link5", "https://link6", "https://link7"]
}
Attempt
pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")
link_dict = {}
for line in lines:
if line:
pass # unsure how to continue
答案1
得分: 3
给定你已经处理过的lines
如下所示:
['first subheading', 'https://link1', 'https://link2', '',
'second subheading', 'https://link3', '',
'third subheading', 'https://link4', 'https://link5',
'https://link6', 'https://link7']
你可以使用itertools.groupby
来简洁地完成这个任务:
from itertools import groupby
{next(g): [*g] for k, g in groupby(lines, key=bool) if k}
# {'first subheading': ['https://link1', 'https://link2'],
# 'second subheading': ['https://link3'],
# 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']}
<details>
<summary>英文:</summary>
Given that you have already processed `lines` to be:
['first subheading', 'https://link1', 'https://link2', '',
'second subheading', 'https://link3', '',
'third subheading', 'https://link4', 'https://link5',
'https://link6', 'https://link7']
You can do this concisely using [`itertools.groupby`][0]
from itertools import groupby
{next(g): [*g] for k, g in groupby(lines, key=bool) if k}
# {'first subheading': ['https://link1', 'https://link2'],
# 'second subheading': ['https://link3'],
# 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']}
[0]: https://docs.python.org/3/library/itertools.html#itertools.groupby
</details>
# 答案2
**得分**: 1
```python
{
'first subheading': ['https://link1', 'https://link2'],
'second subheading': ['https://link3'],
'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
}
英文:
example = """Some random text before.
----CAPITAL WORDS:
first subheading
https://link1
https://link2
second subheading
https://link3
third subheading
https://link4
https://link5
https://link6
https://link7
----MORE CAPITAL WORDS:
Some random text after."""
out = {}
current_heading = None
for line in example.splitlines():
if line.startswith('----'):
pass
elif line.startswith('http'):
out[current_heading].append(line)
elif line.islower():
current_heading = line
out[current_heading] = []
Output:
{'first subheading': ['https://link1', 'https://link2'],
'second subheading': ['https://link3'],
'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']}
答案3
得分: 1
以下是翻译好的内容:
# 匹配 `----CAPITAL WORDS:` 并处理之后的部分,不要跨越 `----CAPITAL WORDS:` 或 `----MORE CAPITAL WORDS:`
# 可以使用 [PyPi regex 模块](https://pypi.org/project/regex/) 和 `captures()` 来重复捕获组。
pattern = r"(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P<sub>\S.*)(?:\n(?!(?1))(?P<val>\S.*))+\s*"
# 模式匹配:
# (?: 非捕获组
# ^----CAPITAL WORDS:\n 从字符串开头匹配 "----CAPITAL WORDS:" 后跟换行符
# | 或者
# \G 断言当前位置在前一次匹配的结尾
# ) 非捕获组结束
# (?! 负向先行断言,断言右侧不是
# (----(?:MORE )?CAPITAL WORDS:) 捕获到 **第1组**,匹配带有可选的 `MORE ` 部分的 `CAPITAL WORDS:`
# ) 负向先行断言结束
# (?P<sub>\S.*) 捕获组 sub,匹配单行子标题(至少以一个非空格字符开头以防止匹配空行)
# (?: 非捕获组,作为整体重复1次或更多次
# \n 匹配换行符
# (?!(?1)) 断言第1组的模式右侧没有直接出现
# (?P<val>\S.*) 捕获组 val,捕获单行的值,如 "https://link1"
# )+ 非捕获组结束并重复1次或更多次
# \s* 匹配可选的空白字符
# 查看 [regex 演示](https://regex101.com/r/piFIiu/1) 和 [Python 演示](https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828)。
import regex
s = ("Some random text before.\n\n"
"----CAPITAL WORDS:\n"
"first subheading\n"
"https://link1\n"
"https://link2\n\n"
"second subheading\n"
"https://link3\n\n"
"third subheading\n"
"https://link4\n"
"https://link5\n"
"https://link6\n"
"https://link7\n\n"
"----MORE CAPITAL WORDS:\n"
"Some random text after.")
matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
dct[m.captures("sub")[0]] = m.captures("val")
print(dct)
# 输出
# {
# 'first subheading': ['https://link1', 'https://link2'],
# 'second subheading': ['https://link3'],
# 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
# }
英文:
To match ----CAPITAL WORDS:
and process all following parts without crossing either ----CAPITAL WORDS:
or ----MORE CAPITAL WORDS:
you could make use of the PyPi regex module and captures()
for repeated capture groups.
(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P<sub>\S.*)(?:\n(?!(?1))(?P<val>\S.*))+\s*
The pattern matches:
(?:
Non capture group^----CAPITAL WORDS:\n
Match literally from the start of the string followed by a newline|
Or\G
Assert the current position at the end of the previous match
)
Close the non capture group(?!
Negative lookahead, assert that what is directly to the right is not(----(?:MORE )?CAPITAL WORDS:)
Capture in group 1, matchingCAPITAL WORDS:
with an optional leadingMORE
part
)
Close the negative lookahead(?P<sub>\S.*)
Capture group sub, match the single lined subheading (starting with at least a single non whitespace char to prevent matching empty lines)(?:
Non capture group to repeat as a whole part\n
Match a newline(?!(?1))
Assert that the pattern of group 1 is not directly to the right(?P<val>\S.*)
Capture in group val the single lines values like "https://link1"
)+
Close the non capture group and repeat it 1+ times\s*
Match optional whitespace chars
See a regex demo and a Python demo.
import regex
pattern = r"(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P<sub>\S.*)(?:\n(?!(?1))(?P<val>\S.*))+\s*"
s = ("Some random text before.\n\n"
"----CAPITAL WORDS:\n"
"first subheading\n"
"https://link1\n"
"https://link2\n\n"
"second subheading\n"
"https://link3\n\n"
"third subheading\n"
"https://link4\n"
"https://link5\n"
"https://link6\n"
"https://link7\n\n"
"----MORE CAPITAL WORDS:\n"
"Some random text after.")
matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
dct[m.captures("sub")[0]] = m.captures("val")
print(dct)
Output
{
'first subheading': ['https://link1', 'https://link2'],
'second subheading': ['https://link3'],
'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论