2023年8月10日 22:09:41go评论271阅读模式

英文:

Python - turns string into dictionary where keys are subheadings and values are links

问题

pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")

link_dict = {}
current_subheading = None
for line in lines:
    if line.startswith("----"):
        current_subheading = line.replace("----", "").strip()
        link_dict[current_subheading] = []
    elif current_subheading:
        link_dict[current_subheading].append(line.strip())

英文:

In the middle of some text I have the following.

Some random text before.

----CAPITAL WORDS:
first subheading
https://link1
https://link2

second subheading
https://link3

third subheading
https://link4
https://link5
https://link6
https://link7

----MORE CAPITAL WORDS:
Some random text after.

I would like to extract the string between ----CAPITAL WORDS: and ----MORE CAPITAL WORDS and store it in a dictionary as follows

{
    &#39;first subheading&#39;: [&quot;https://link1&quot;, &quot;https://link2&quot;],
    &#39;second subheading&#39;: [&quot;https://link3&quot;]
    &#39;third subheading&#39;: [&quot;https://link4&quot;, &quot;https://link5&quot;, &quot;https://link6&quot;, &quot;https://link7&quot;]
}

Attempt

pattern = r&quot;CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)&quot;
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split(&quot;\n&quot;)

link_dict = {}
for line in lines:
    if line:
         pass # unsure how to continue

答案1

得分: 3

给定你已经处理过的lines如下所示：

['first subheading', 'https://link1', 'https://link2', '',
 'second subheading', 'https://link3', '',
 'third subheading', 'https://link4', 'https://link5',
 'https://link6', 'https://link7']

你可以使用itertools.groupby来简洁地完成这个任务：

from itertools import groupby

{next(g): [*g] for k, g in groupby(lines, key=bool) if k}
# {'first subheading':  ['https://link1', 'https://link2'], 
#  'second subheading': ['https://link3'], 
#  'third subheading':  ['https://link4', 'https://link5', 'https://link6', 'https://link7']}


<details>
<summary>英文:</summary>

Given that you have already processed `lines` to be:

    [&#39;first subheading&#39;, &#39;https://link1&#39;, &#39;https://link2&#39;, &#39;&#39;, 
     &#39;second subheading&#39;, &#39;https://link3&#39;, &#39;&#39;, 
     &#39;third subheading&#39;, &#39;https://link4&#39;, &#39;https://link5&#39;, 
     &#39;https://link6&#39;, &#39;https://link7&#39;]

You can do this concisely using [`itertools.groupby`][0]

    from itertools import groupby
    
    {next(g): [*g] for k, g in groupby(lines, key=bool) if k}
    # {&#39;first subheading&#39;:  [&#39;https://link1&#39;, &#39;https://link2&#39;], 
    #  &#39;second subheading&#39;: [&#39;https://link3&#39;], 
    #  &#39;third subheading&#39;:  [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}

[0]: https://docs.python.org/3/library/itertools.html#itertools.groupby

</details>



# 答案2
**得分**: 1

```python
{
    'first subheading': ['https://link1', 'https://link2'],
    'second subheading': ['https://link3'],
    'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
}

英文:

example = &quot;&quot;&quot;Some random text before.

----CAPITAL WORDS:
first subheading
https://link1
https://link2

second subheading
https://link3

third subheading
https://link4
https://link5
https://link6
https://link7

----MORE CAPITAL WORDS:
Some random text after.&quot;&quot;&quot;


out = {}
current_heading = None
for line in example.splitlines():
    if line.startswith(&#39;----&#39;):
        pass
    elif line.startswith(&#39;http&#39;):
        out[current_heading].append(line)
    elif line.islower():
        current_heading = line
        out[current_heading] = []

Output:

{&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;], 
&#39;second subheading&#39;: [&#39;https://link3&#39;], 
&#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}

答案3

得分: 1

以下是翻译好的内容：

# 匹配 `----CAPITAL WORDS:` 并处理之后的部分，不要跨越 `----CAPITAL WORDS:` 或 `----MORE CAPITAL WORDS:`
# 可以使用 [PyPi regex 模块](https://pypi.org/project/regex/) 和 `captures()` 来重复捕获组。

pattern = r"(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P<sub>\S.*)(?:\n(?!(?1))(?P<val>\S.*))+\s*"

# 模式匹配:

# (?: 非捕获组
#   ^----CAPITAL WORDS:\n 从字符串开头匹配 "----CAPITAL WORDS:" 后跟换行符
#   | 或者
#   \G 断言当前位置在前一次匹配的结尾
# ) 非捕获组结束

# (?! 负向先行断言，断言右侧不是
#   (----(?:MORE )?CAPITAL WORDS:) 捕获到 **第1组**，匹配带有可选的 `MORE ` 部分的 `CAPITAL WORDS:`
# ) 负向先行断言结束

# (?P<sub>\S.*) 捕获组 sub，匹配单行子标题（至少以一个非空格字符开头以防止匹配空行）

# (?: 非捕获组，作为整体重复1次或更多次
#   \n 匹配换行符
#   (?!(?1)) 断言第1组的模式右侧没有直接出现
#   (?P<val>\S.*) 捕获组 val，捕获单行的值，如 "https://link1"
# )+ 非捕获组结束并重复1次或更多次

# \s* 匹配可选的空白字符

# 查看 [regex 演示](https://regex101.com/r/piFIiu/1) 和 [Python 演示](https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828)。

import regex

s = ("Some random text before.\n\n"
        "----CAPITAL WORDS:\n"
        "first subheading\n"
        "https://link1\n"
        "https://link2\n\n"
        "second subheading\n"
        "https://link3\n\n"
        "third subheading\n"
        "https://link4\n"
        "https://link5\n"
        "https://link6\n"
        "https://link7\n\n"
        "----MORE CAPITAL WORDS:\n"
        "Some random text after.")

matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
    dct[m.captures("sub")[0]] = m.captures("val")
print(dct)

# 输出

# {
# 'first subheading': ['https://link1', 'https://link2'],
# 'second subheading': ['https://link3'],
# 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
# }

英文:

To match ----CAPITAL WORDS: and process all following parts without crossing either ----CAPITAL WORDS: or ----MORE CAPITAL WORDS: you could make use of the PyPi regex module and captures() for repeated capture groups.

(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*

The pattern matches:

(?: Non capture group
- ^----CAPITAL WORDS:\n Match literally from the start of the string followed by a newline
- | Or
- \G Assert the current position at the end of the previous match
) Close the non capture group
(?! Negative lookahead, assert that what is directly to the right is not
- (----(?:MORE )?CAPITAL WORDS:) Capture in group 1, matching CAPITAL WORDS: with an optional leading MORE part
) Close the negative lookahead
(?P<sub>\S.*) Capture group sub, match the single lined subheading (starting with at least a single non whitespace char to prevent matching empty lines)
(?: Non capture group to repeat as a whole part
- \n Match a newline
- (?!(?1)) Assert that the pattern of group 1 is not directly to the right
- (?P<val>\S.*) Capture in group val the single lines values like "https://link1"
)+ Close the non capture group and repeat it 1+ times
\s* Match optional whitespace chars

See a regex demo and a Python demo.

import regex
pattern = r&quot;(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*&quot;
s = (&quot;Some random text before.\n\n&quot;
&quot;----CAPITAL WORDS:\n&quot;
&quot;first subheading\n&quot;
&quot;https://link1\n&quot;
&quot;https://link2\n\n&quot;
&quot;second subheading\n&quot;
&quot;https://link3\n\n&quot;
&quot;third subheading\n&quot;
&quot;https://link4\n&quot;
&quot;https://link5\n&quot;
&quot;https://link6\n&quot;
&quot;https://link7\n\n&quot;
&quot;----MORE CAPITAL WORDS:\n&quot;
&quot;Some random text after.&quot;)
matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
dct[m.captures(&quot;sub&quot;)[0]] = m.captures(&quot;val&quot;)
print(dct)

Output

{
&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
&#39;second subheading&#39;: [&#39;https://link3&#39;],
&#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python – 将字符串转换为字典，其中键是副标题，值是链接。

问题

Attempt

答案1

答案3

在Python中两列相减时得到NaN。

为什么Firefox的Selenium WebDriver不能处理超过20个标签页？

Python逻辑回归，使用2个特征数据X和标签Y – 训练准确度

将一个 pandas 数据框映射到一个 n 维数组，其中每个维度对应一个 x 列之一

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论