Python – 将字符串转换为字典,其中键是副标题,值是链接。

huangapple go评论102阅读模式
英文:

Python - turns string into dictionary where keys are subheadings and values are links

问题

pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")

link_dict = {}
current_subheading = None
for line in lines:
    if line.startswith("----"):
        current_subheading = line.replace("----", "").strip()
        link_dict[current_subheading] = []
    elif current_subheading:
        link_dict[current_subheading].append(line.strip())
英文:

In the middle of some text I have the following.

Some random text before.

----CAPITAL WORDS:
first subheading
https://link1
https://link2

second subheading
https://link3

third subheading
https://link4
https://link5
https://link6
https://link7

----MORE CAPITAL WORDS:
Some random text after.

I would like to extract the string between ----CAPITAL WORDS: and ----MORE CAPITAL WORDS and store it in a dictionary as follows

{
    'first subheading': ["https://link1", "https://link2"],
    'second subheading': ["https://link3"]
    'third subheading': ["https://link4", "https://link5", "https://link6", "https://link7"]
}

Attempt

pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")

link_dict = {}
for line in lines:
    if line:
         pass # unsure how to continue

答案1

得分: 3

给定你已经处理过的lines如下所示:

['first subheading', 'https://link1', 'https://link2', '',
 'second subheading', 'https://link3', '',
 'third subheading', 'https://link4', 'https://link5',
 'https://link6', 'https://link7']

你可以使用itertools.groupby来简洁地完成这个任务:

from itertools import groupby

{next(g): [*g] for k, g in groupby(lines, key=bool) if k}
# {'first subheading':  ['https://link1', 'https://link2'], 
#  'second subheading': ['https://link3'], 
#  'third subheading':  ['https://link4', 'https://link5', 'https://link6', 'https://link7']}

<details>
<summary>英文:</summary>

Given that you have already processed `lines` to be:

    [&#39;first subheading&#39;, &#39;https://link1&#39;, &#39;https://link2&#39;, &#39;&#39;, 
     &#39;second subheading&#39;, &#39;https://link3&#39;, &#39;&#39;, 
     &#39;third subheading&#39;, &#39;https://link4&#39;, &#39;https://link5&#39;, 
     &#39;https://link6&#39;, &#39;https://link7&#39;]

You can do this concisely using [`itertools.groupby`][0]

    from itertools import groupby
    
    {next(g): [*g] for k, g in groupby(lines, key=bool) if k}
    # {&#39;first subheading&#39;:  [&#39;https://link1&#39;, &#39;https://link2&#39;], 
    #  &#39;second subheading&#39;: [&#39;https://link3&#39;], 
    #  &#39;third subheading&#39;:  [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}

[0]: https://docs.python.org/3/library/itertools.html#itertools.groupby

</details>



# 答案2
**得分**: 1

```python
{
    'first subheading': ['https://link1', 'https://link2'],
    'second subheading': ['https://link3'],
    'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
}
英文:
example = &quot;&quot;&quot;Some random text before.

----CAPITAL WORDS:
first subheading
https://link1
https://link2

second subheading
https://link3

third subheading
https://link4
https://link5
https://link6
https://link7

----MORE CAPITAL WORDS:
Some random text after.&quot;&quot;&quot;


out = {}
current_heading = None
for line in example.splitlines():
    if line.startswith(&#39;----&#39;):
        pass
    elif line.startswith(&#39;http&#39;):
        out[current_heading].append(line)
    elif line.islower():
        current_heading = line
        out[current_heading] = []

Output:

{&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;], 
&#39;second subheading&#39;: [&#39;https://link3&#39;], 
&#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}

答案3

得分: 1

以下是翻译好的内容:

# 匹配 `----CAPITAL WORDS:` 并处理之后的部分,不要跨越 `----CAPITAL WORDS:` 或 `----MORE CAPITAL WORDS:`
# 可以使用 [PyPi regex 模块](https://pypi.org/project/regex/) 和 `captures()` 来重复捕获组。

pattern = r"(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P<sub>\S.*)(?:\n(?!(?1))(?P<val>\S.*))+\s*"

# 模式匹配:

# (?: 非捕获组
#   ^----CAPITAL WORDS:\n 从字符串开头匹配 "----CAPITAL WORDS:" 后跟换行符
#   | 或者
#   \G 断言当前位置在前一次匹配的结尾
# ) 非捕获组结束

# (?! 负向先行断言,断言右侧不是
#   (----(?:MORE )?CAPITAL WORDS:) 捕获到 **第1组**,匹配带有可选的 `MORE ` 部分的 `CAPITAL WORDS:`
# ) 负向先行断言结束

# (?P<sub>\S.*) 捕获组 sub,匹配单行子标题(至少以一个非空格字符开头以防止匹配空行)

# (?: 非捕获组,作为整体重复1次或更多次
#   \n 匹配换行符
#   (?!(?1)) 断言第1组的模式右侧没有直接出现
#   (?P<val>\S.*) 捕获组 val,捕获单行的值,如 "https://link1"
# )+ 非捕获组结束并重复1次或更多次

# \s* 匹配可选的空白字符

# 查看 [regex 演示](https://regex101.com/r/piFIiu/1) 和 [Python 演示](https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828)。

import regex

s = ("Some random text before.\n\n"
        "----CAPITAL WORDS:\n"
        "first subheading\n"
        "https://link1\n"
        "https://link2\n\n"
        "second subheading\n"
        "https://link3\n\n"
        "third subheading\n"
        "https://link4\n"
        "https://link5\n"
        "https://link6\n"
        "https://link7\n\n"
        "----MORE CAPITAL WORDS:\n"
        "Some random text after.")

matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
    dct[m.captures("sub")[0]] = m.captures("val")
print(dct)

# 输出

# {
# 'first subheading': ['https://link1', 'https://link2'],
# 'second subheading': ['https://link3'],
# 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
# }
英文:

To match ----CAPITAL WORDS: and process all following parts without crossing either ----CAPITAL WORDS: or ----MORE CAPITAL WORDS: you could make use of the PyPi regex module and captures() for repeated capture groups.

(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*

The pattern matches:

  • (?: Non capture group
    • ^----CAPITAL WORDS:\n Match literally from the start of the string followed by a newline
    • | Or
    • \G Assert the current position at the end of the previous match
  • ) Close the non capture group
  • (?! Negative lookahead, assert that what is directly to the right is not
    • (----(?:MORE )?CAPITAL WORDS:) Capture in group 1, matching CAPITAL WORDS: with an optional leading MORE part
  • ) Close the negative lookahead
  • (?P&lt;sub&gt;\S.*) Capture group sub, match the single lined subheading (starting with at least a single non whitespace char to prevent matching empty lines)
  • (?: Non capture group to repeat as a whole part
    • \n Match a newline
    • (?!(?1)) Assert that the pattern of group 1 is not directly to the right
    • (?P&lt;val&gt;\S.*) Capture in group val the single lines values like "https://link1"
  • )+ Close the non capture group and repeat it 1+ times
  • \s* Match optional whitespace chars

See a regex demo and a Python demo.

import regex
pattern = r&quot;(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*&quot;
s = (&quot;Some random text before.\n\n&quot;
&quot;----CAPITAL WORDS:\n&quot;
&quot;first subheading\n&quot;
&quot;https://link1\n&quot;
&quot;https://link2\n\n&quot;
&quot;second subheading\n&quot;
&quot;https://link3\n\n&quot;
&quot;third subheading\n&quot;
&quot;https://link4\n&quot;
&quot;https://link5\n&quot;
&quot;https://link6\n&quot;
&quot;https://link7\n\n&quot;
&quot;----MORE CAPITAL WORDS:\n&quot;
&quot;Some random text after.&quot;)
matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
dct[m.captures(&quot;sub&quot;)[0]] = m.captures(&quot;val&quot;)
print(dct)

Output

{
&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
&#39;second subheading&#39;: [&#39;https://link3&#39;],
&#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]
}

huangapple
  • 本文由 发表于 2023年8月10日 22:09:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76876526.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定