创建一个正则表达式来从Markdown中提取代码的问题

huangapple go评论65阅读模式
英文:

Issues creating a regex to extract code from Markdown

问题

我试图从Markdown字符串中提取代码,离成功很近了。我的代码是:

import re

string = """
Lorem ipsum
```python
print('foo```bar```foo')
print('foo```bar```foo')

Lorem ipsum
"""

pattern = r'(?:\w+\n)?(.*?)(?!.*)'
result = re.search(pattern, string, re.DOTALL).group(1)
print(result)


这个代码的结果是:

print('foobarfoo')
print('foobarfoo')
`


你会注意到,我唯一的问题是代码块末尾多了一个额外的反引号。我无法确定是什么匹配到了它,或者如何移除它,但我确信与我使用的负向预查有关。

<details>
<summary>英文:</summary>

I&#39;m trying to extract code from a string of Markdown and I&#39;m very close. My code is:

import re

string = """
Lorem ipsum

print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)

Lorem ipsum
"""

pattern = r'(?:\w+\n)?(.*?)(?!.*)'
result = re.search(pattern, string, re.DOTALL).group(1)
print(result)


And the result of this is:
```
print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)
`
```

You&#39;ll notice the only problem I have is the extra backtick at the end of that code block. I can&#39;t figure out what&#39;s matching that or how to remove it but I&#39;m certain it has something to do with the negative lookahead I&#39;m using. 

</details>


# 答案1
**得分**: 1

以下是您要翻译的内容:

第一个不匹配&lt;code&gt;.*\`\`\`&lt;/code&gt;(因此终止匹配)的字符是&lt;code&gt;行后的&lt;code&gt;\`&lt;/code&gt;。例如,查看[此演示][1]。请注意,当存在多个代码块时,此方法根本不起作用。

可能最安全的方法是依赖&lt;code&gt;\`\`\`&lt;/code&gt;是行首的第一件事。然后,您可以匹配到下一个以&lt;code&gt;\`\`\`&lt;/code&gt;开头的行:

```regex
^```(?:\w+)?\s*\n(.*?)(?=^```)```
```
[在regex101上查看演示][2]

在python中:
````python
import re

string = &quot;&quot;&quot;
Lorem ipsum
```python
print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)
```python
print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)

"""

pattern = r'^(?:\w+)?\s*\n(.*?)(?=^)```'
result = re.findall(pattern, string, re.DOTALL | re.MULTILINE)
print(*[r for r in result], sep='\n')


输出:

print('foobarfoo')
print('foobarfoo')

print('foobarfoo')
print('foobarfoo')

print('foobarfoo')
print('foobarfoo')

  [1]: https://regex101.com/r/UMHjzn/1
  [2]: https://regex101.com/r/8iN6FJ/1

<details>
<summary>英文:</summary>

The first character which doesn&#39;t match &lt;code&gt;.*\`\`\`&lt;/code&gt; (and hence terminates the match) is the &lt;code&gt;\`&lt;/code&gt; at the start of the line after &lt;code&gt;print(&#39;foo\`\`\`bar\`\`\`foo&#39;)&lt;/code&gt;. See for example [this demo][1]. You&#39;ll note that this method doesn&#39;t work at all when there is more than one code block.

Probably the safest approach is to rely on the &lt;code&gt;\`\`\`&lt;/code&gt; being the first thing on the line. Then you can match up to the next occurrence of &lt;code&gt;\`\`\`&lt;/code&gt; at the start of a line instead:

```regex
^```(?:\w+)?\s*\n(.*?)(?=^```)```
```
[Demo on regex101][2]

In python:
````python
import re

string = &quot;&quot;&quot;
Lorem ipsum
```python
print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)
```
Lorem ipsum
```python
print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)
```
Lorem ipsum
```
print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)
```
&quot;&quot;&quot;

pattern = r&#39;^```(?:\w+)?\s*\n(.*?)(?=^```)```&#39;
result = re.findall(pattern, string, re.DOTALL | re.MULTILINE)
print(*[r for r in result], sep=&#39;\n&#39;)

Output:

print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)

print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)

print(&#39;foo```bar```foo&#39;)
print(&#39;foo```bar```foo&#39;)

huangapple
  • 本文由 发表于 2023年5月17日 16:18:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76269934.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定