Python – 将字符串转换为字典,其中键是副标题,值是链接。

huangapple go评论148阅读模式
英文:

Python - turns string into dictionary where keys are subheadings and values are links

问题

  1. pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
  2. matches = re.search(pattern, descriptions[0], re.DOTALL)
  3. lines = matches.group(1).strip().split("\n")
  4. link_dict = {}
  5. current_subheading = None
  6. for line in lines:
  7. if line.startswith("----"):
  8. current_subheading = line.replace("----", "").strip()
  9. link_dict[current_subheading] = []
  10. elif current_subheading:
  11. link_dict[current_subheading].append(line.strip())
英文:

In the middle of some text I have the following.

  1. Some random text before.
  2. ----CAPITAL WORDS:
  3. first subheading
  4. https://link1
  5. https://link2
  6. second subheading
  7. https://link3
  8. third subheading
  9. https://link4
  10. https://link5
  11. https://link6
  12. https://link7
  13. ----MORE CAPITAL WORDS:
  14. Some random text after.

I would like to extract the string between ----CAPITAL WORDS: and ----MORE CAPITAL WORDS and store it in a dictionary as follows

  1. {
  2. 'first subheading': ["https://link1", "https://link2"],
  3. 'second subheading': ["https://link3"]
  4. 'third subheading': ["https://link4", "https://link5", "https://link6", "https://link7"]
  5. }

Attempt

  1. pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
  2. matches = re.search(pattern, descriptions[0], re.DOTALL)
  3. lines = matches.group(1).strip().split("\n")
  4. link_dict = {}
  5. for line in lines:
  6. if line:
  7. pass # unsure how to continue

答案1

得分: 3

给定你已经处理过的lines如下所示:

  1. ['first subheading', 'https://link1', 'https://link2', '',
  2. 'second subheading', 'https://link3', '',
  3. 'third subheading', 'https://link4', 'https://link5',
  4. 'https://link6', 'https://link7']

你可以使用itertools.groupby来简洁地完成这个任务:

  1. from itertools import groupby
  2. {next(g): [*g] for k, g in groupby(lines, key=bool) if k}
  3. # {'first subheading': ['https://link1', 'https://link2'],
  4. # 'second subheading': ['https://link3'],
  5. # 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']}
  1. <details>
  2. <summary>英文:</summary>
  3. Given that you have already processed `lines` to be:
  4. [&#39;first subheading&#39;, &#39;https://link1&#39;, &#39;https://link2&#39;, &#39;&#39;,
  5. &#39;second subheading&#39;, &#39;https://link3&#39;, &#39;&#39;,
  6. &#39;third subheading&#39;, &#39;https://link4&#39;, &#39;https://link5&#39;,
  7. &#39;https://link6&#39;, &#39;https://link7&#39;]
  8. You can do this concisely using [`itertools.groupby`][0]
  9. from itertools import groupby
  10. {next(g): [*g] for k, g in groupby(lines, key=bool) if k}
  11. # {&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
  12. # &#39;second subheading&#39;: [&#39;https://link3&#39;],
  13. # &#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}
  14. [0]: https://docs.python.org/3/library/itertools.html#itertools.groupby
  15. </details>
  16. # 答案2
  17. **得分**: 1
  18. ```python
  19. {
  20. 'first subheading': ['https://link1', 'https://link2'],
  21. 'second subheading': ['https://link3'],
  22. 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
  23. }
英文:
  1. example = &quot;&quot;&quot;Some random text before.
  2. ----CAPITAL WORDS:
  3. first subheading
  4. https://link1
  5. https://link2
  6. second subheading
  7. https://link3
  8. third subheading
  9. https://link4
  10. https://link5
  11. https://link6
  12. https://link7
  13. ----MORE CAPITAL WORDS:
  14. Some random text after.&quot;&quot;&quot;
  15. out = {}
  16. current_heading = None
  17. for line in example.splitlines():
  18. if line.startswith(&#39;----&#39;):
  19. pass
  20. elif line.startswith(&#39;http&#39;):
  21. out[current_heading].append(line)
  22. elif line.islower():
  23. current_heading = line
  24. out[current_heading] = []

Output:

  1. {&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
  2. &#39;second subheading&#39;: [&#39;https://link3&#39;],
  3. &#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}

答案3

得分: 1

以下是翻译好的内容:

  1. # 匹配 `----CAPITAL WORDS:` 并处理之后的部分,不要跨越 `----CAPITAL WORDS:` 或 `----MORE CAPITAL WORDS:`
  2. # 可以使用 [PyPi regex 模块](https://pypi.org/project/regex/) 和 `captures()` 来重复捕获组。
  3. pattern = r"(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P<sub>\S.*)(?:\n(?!(?1))(?P<val>\S.*))+\s*"
  4. # 模式匹配:
  5. # (?: 非捕获组
  6. # ^----CAPITAL WORDS:\n 从字符串开头匹配 "----CAPITAL WORDS:" 后跟换行符
  7. # | 或者
  8. # \G 断言当前位置在前一次匹配的结尾
  9. # ) 非捕获组结束
  10. # (?! 负向先行断言,断言右侧不是
  11. # (----(?:MORE )?CAPITAL WORDS:) 捕获到 **第1组**,匹配带有可选的 `MORE ` 部分的 `CAPITAL WORDS:`
  12. # ) 负向先行断言结束
  13. # (?P<sub>\S.*) 捕获组 sub,匹配单行子标题(至少以一个非空格字符开头以防止匹配空行)
  14. # (?: 非捕获组,作为整体重复1次或更多次
  15. # \n 匹配换行符
  16. # (?!(?1)) 断言第1组的模式右侧没有直接出现
  17. # (?P<val>\S.*) 捕获组 val,捕获单行的值,如 "https://link1"
  18. # )+ 非捕获组结束并重复1次或更多次
  19. # \s* 匹配可选的空白字符
  20. # 查看 [regex 演示](https://regex101.com/r/piFIiu/1) 和 [Python 演示](https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828)。
  21. import regex
  22. s = ("Some random text before.\n\n"
  23. "----CAPITAL WORDS:\n"
  24. "first subheading\n"
  25. "https://link1\n"
  26. "https://link2\n\n"
  27. "second subheading\n"
  28. "https://link3\n\n"
  29. "third subheading\n"
  30. "https://link4\n"
  31. "https://link5\n"
  32. "https://link6\n"
  33. "https://link7\n\n"
  34. "----MORE CAPITAL WORDS:\n"
  35. "Some random text after.")
  36. matches = regex.finditer(pattern, s, regex.MULTILINE)
  37. dct = {}
  38. for _, m in enumerate(matches):
  39. dct[m.captures("sub")[0]] = m.captures("val")
  40. print(dct)
  41. # 输出
  42. # {
  43. # 'first subheading': ['https://link1', 'https://link2'],
  44. # 'second subheading': ['https://link3'],
  45. # 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
  46. # }
英文:

To match ----CAPITAL WORDS: and process all following parts without crossing either ----CAPITAL WORDS: or ----MORE CAPITAL WORDS: you could make use of the PyPi regex module and captures() for repeated capture groups.

  1. (?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*

The pattern matches:

  • (?: Non capture group
    • ^----CAPITAL WORDS:\n Match literally from the start of the string followed by a newline
    • | Or
    • \G Assert the current position at the end of the previous match
  • ) Close the non capture group
  • (?! Negative lookahead, assert that what is directly to the right is not
    • (----(?:MORE )?CAPITAL WORDS:) Capture in group 1, matching CAPITAL WORDS: with an optional leading MORE part
  • ) Close the negative lookahead
  • (?P&lt;sub&gt;\S.*) Capture group sub, match the single lined subheading (starting with at least a single non whitespace char to prevent matching empty lines)
  • (?: Non capture group to repeat as a whole part
    • \n Match a newline
    • (?!(?1)) Assert that the pattern of group 1 is not directly to the right
    • (?P&lt;val&gt;\S.*) Capture in group val the single lines values like "https://link1"
  • )+ Close the non capture group and repeat it 1+ times
  • \s* Match optional whitespace chars

See a regex demo and a Python demo.

  1. import regex
  2. pattern = r&quot;(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*&quot;
  3. s = (&quot;Some random text before.\n\n&quot;
  4. &quot;----CAPITAL WORDS:\n&quot;
  5. &quot;first subheading\n&quot;
  6. &quot;https://link1\n&quot;
  7. &quot;https://link2\n\n&quot;
  8. &quot;second subheading\n&quot;
  9. &quot;https://link3\n\n&quot;
  10. &quot;third subheading\n&quot;
  11. &quot;https://link4\n&quot;
  12. &quot;https://link5\n&quot;
  13. &quot;https://link6\n&quot;
  14. &quot;https://link7\n\n&quot;
  15. &quot;----MORE CAPITAL WORDS:\n&quot;
  16. &quot;Some random text after.&quot;)
  17. matches = regex.finditer(pattern, s, regex.MULTILINE)
  18. dct = {}
  19. for _, m in enumerate(matches):
  20. dct[m.captures(&quot;sub&quot;)[0]] = m.captures(&quot;val&quot;)
  21. print(dct)

Output

  1. {
  2. &#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
  3. &#39;second subheading&#39;: [&#39;https://link3&#39;],
  4. &#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]
  5. }

huangapple
  • 本文由 发表于 2023年8月10日 22:09:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76876526.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定