匹配 p 标签之间带引号的文本(正则表达式) – Calibre 搜索与替换

huangapple go评论55阅读模式
英文:

How to match text in quotation marks between p tags (regex) - Calibre Search and Replace

问题

I understand your request. Here's the translated text you asked for:

我需要对下面的文本进行一些格式化,为此我需要匹配仅在p标签(<p>和</p>)内部的引号之间的文本。

下面的文本是一个示例:

&lt;div class=&quot;vung_doc&quot; id=&quot;vung_doc&quot;&gt;
&lt;p&gt;Volume 1: The Mysterious Driver
&lt;/p&gt;
&lt;p&gt;He picked up the pistol from the pool of blood and pointed it at the
person coming towards him, screaming, &quot;I&#39;ll kill you!&quot;
&lt;/p&gt;
&lt;p&gt;No matter how many times he pressed the trigger, the rounds didn&#39;t budge.
The approaching figure mockingly spoke, &quot;Haha, what a scene! The Great Detective Song Lang, 
actually killing his superior and partner with his very own hands! I can&#39;t wait to see the 
headlines in the newspapers tomorrow!&quot;
&lt;/p&gt;

我只需要匹配 &quot;I&#39;ll kill you!&quot;&quot;Haha, what a scene! The Great Detective Song Lang, actually killing his superior and partner with his very own hands! I can&#39;t wait to see the headlines in the newspapers tomorrow!&quot;

但是大多数正则表达式都会获取引号之间的所有文本 *&quot;(.*?)&quot;*,获取p标签之间的所有文本 *\&lt;p\&gt;(.*?)\&lt;\\/p\&gt;* 或者介于两者之间的文本。

我使用Calibre的搜索和替换功能,所以只能使用一行正则表达式。我使用ReExr来测试这些表达式。

英文:

I need to do some formatting on the text below and to do so I need to match only the text between quotes inside p tags (&lt;p&gt; and &lt;/p&gt;).

This text below is an example:

&lt;div class=&quot;vung_doc&quot; id=&quot;vung_doc&quot;&gt;
&lt;p&gt;Volume 1: The Mysterious Driver
&lt;/p&gt;
&lt;p&gt;He picked up the pistol from the pool of blood and pointed it at the
person coming towards him, screaming, &quot;I&#39;ll kill you!&quot;
&lt;/p&gt;
&lt;p&gt;No matter how many times he pressed the trigger, the rounds didn&#39;t budge.
The approaching figure mockingly spoke, &quot;Haha, what a scene! The Great Detective Song Lang, 
actually killing his superior and partner with his very own hands! I can&#39;t wait to see the 
headlines in the newspapers tomorrow!&quot;
&lt;/p&gt;

I need only to match &quot;I&#39;ll kill you!&quot; and &quot;Haha, what a scene! The Great Detective Song Lang, actually killing his superior and partner with his very own hands! I can&#39;t wait to see the headlines in the newspapers tomorrow!&quot;

But most of the regex I tried got all the text between quotes *&quot;(.\*?)&quot;*, all the text between the p tags *\&lt;p\&gt;(.|\\n)\*?\&lt;\\/p\&gt;* or something in between.

I use Calibre search and replace, so only one line of regex. I use ReExr to test the expressions.

答案1

得分: 1

不要使用 regex 解析 HTML

不能使用设计用于处理原始文本行的工具来解析任何结构化文本,如XML/HTML。如果需要处理XML/HTML,请使用XML/HTML解析器。大多数编程语言都内置支持解析XML,还有像 xidelxmlstarletxmllint 这样的专门工具,如果需要从命令行快速执行。

使用 xidel

xidel -e '&#39;//p/extract(text(),&quot;&amp;quot;(.+)&amp;quot;&quot;,1,&quot;s&quot;)[.]&#39; 文件

由 Reino 提供。

使用 xidelgrep

xidel -e '&#39;//p&#39; 文件 | grep -oP '&#39;&quot;\K[^&quot;]+&#39; 文件

输出

我会杀了你!
哈哈,这场景太有趣了!伟大的侦探宋浪,居然用自己的双手杀了上司和搭档!我迫不及待想看明天报纸上的头条新闻!

在这里,我仅对 文本部分 使用 grep regex

英文:

Don't use regex to parse HTML

you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel, xmlstarlet or xmllint if you need a quick shot from a command line shell.

With xidel:

xidel -e &#39;//p/extract(text(),&quot;&amp;quot;(.+)&amp;quot;&quot;,1,&quot;s&quot;)[.]&#39; file

Credits to Reino.

With xidel and grep:

xidel -e &#39;//p&#39; file | grep -oP &#39;&quot;\K[^&quot;]+&#39; file

Output

I&#39;ll kill you!
Haha, what a scene! The Great Detective Song Lang, actually killing his superior and partner with his very own hands! I can&#39;t wait to see the headlines in the newspapers tomorrow!

Here, I use grep regex only on the text part.

huangapple
  • 本文由 发表于 2023年5月7日 05:44:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76191296.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定