使用HTML Agility Pack将包含换行或不间断空格的多个p标签替换为单个换行。

huangapple go评论55阅读模式
英文:

Replace multiple p tags containing line breaks or non breaking spaces with a single line break using HTML Agility Pack

问题

如何删除多个包含“空p标签”、“包含不间断空格的p标签”或“包含换行符的p标签”,并替换为“包含换行符的单个p标签”,我认为使用类似HTML Agility Pack的解决方案要比正则表达式更好,但我也愿意听取建议。

例如,以下HTML:

<p>Test</p><p> </p><p> </p><p></p><p></p><p> </p><p>Test 2</p>

或以下更复杂的示例:

<p>Test</p><p> </p><p><br/></p><p><p></p><br data-mce-bogus="1"></p><p></p><p>Test 2</p>

会被替换为以下内容:

<p>Test</p><p><br></p><p>Test 2</p>

因此,实际上,任何可能导致HTML代码中多个换行符的内容都会被替换为一个单一的换行符。

HTML可以从多个来源添加和编辑(例如Web应用程序、iOS应用程序、Android应用程序)以及多个富文本编辑器类型,因此添加换行符的方式不一定一致,因此需要找到并用<p><br></p>替换多种类型的换行符。

英文:

How can I remove multiple "empty p tags" or "p tags containing a non breaking space" or a "p tag containing a line break" and replace with a "single p tag containing a line break", I assume using something like HTML Agility pack is a better solution than Regex but I am open to suggestions.

For example the following HTML:

<p>Test</p><p> </p><p> </p><p></p><p></p><p> </p><p>Test 2</p>

Or the following more complex example:

<p>Test</p><p> </p><p><br/></p><p><p></p><br data-mce-bogus="1"></p><p></p><p>Test 2</p>

Would get replaced with the following:

<p>Test</p><p><br></p><p>Test 2</p>

So effectively anything that could cause multiple line breaks in the HTML code would get replaced with just a single line break.

The HTML can be added and edited from multiple sources (i.e. web application, iOS App, Android App) and multiple rich text editor types so the way the line breaks have been added are not necessarily consistent hence needing to find and replace multiple types of line break with a single one using <p><br></p>

答案1

得分: 0

以下是代码的翻译部分:

// 加载 HTML 文档
var doc = new HtmlDocument();
doc.LoadHtml(value);

// 选择所有的 p 标签
var pTags = doc.DocumentNode.SelectNodes("//p");

// 如果没有找到 p 标签,就返回原始值
if (pTags == null || pTags.Count <= 0)
    return value;

// 遍历 p 标签
for (int i = 0; i < pTags.Count; i++)
{
    // 检查当前 p 标签
    if (pTags[i].InnerHtml.Trim() == "&nbsp;" || // 包含只有 &nbsp;
        String.IsNullOrWhiteSpace(pTags[i].InnerHtml) || // 或者是空白字符
        (pTags[i].ChildNodes.Any(x => x.Name == "br") && pTags[i].ChildNodes.Where(x => x.Name != "br").All(x => x.InnerHtml.Trim() == "&nbsp;" || String.IsNullOrWhiteSpace(x.InnerHtml)))) // 或者包含只有 "br"(可能两侧还有空白)
    {
        // 转换为换行符
        pTags[i].InnerHtml = "<br>";
    }
    else
        continue;

    // 如果不是第一个 p 标签
    if (i > 0)
    {
        // 检查当前标签和前一个标签是否都包含换行符,如果是,则移除当前标签
        if (pTags[i].InnerHtml == "<br>" && pTags[i - 1].InnerHtml == "<br>")
            doc.DocumentNode.RemoveChild(pTags[i]);
    }
}

// 返回修改后的 HTML
return doc.DocumentNode.OuterHtml;
英文:

With a little bit of help from Chat GPT I have come up with the following code:

// Load the HTML document
var doc = new HtmlDocument();
doc.LoadHtml(value);

// Select all the p tags
var pTags = doc.DocumentNode.SelectNodes(&quot;//p&quot;);

// If no p tags found then return the value
if (pTags == null || pTags.Count &lt;= 0)
    return value;

// Iterate p tags
for (int i = 0; i &lt; pTags.Count; i++)
{
    // Check if current p tag  
    if (pTags[i].InnerHtml.Trim() == &quot;&amp;nbsp;&quot; || // Contains only a &amp;nbsp;
        String.IsNullOrWhiteSpace(pTags[i].InnerHtml) || // Or whitespace
        (pTags[i].ChildNodes.Any(x =&gt; x.Name == &quot;br&quot;) &amp;&amp; pTags[i].ChildNodes.Where(x =&gt; x.Name != &quot;br&quot;).All(x =&gt; x.InnerHtml.Trim() == &quot;&amp;nbsp;&quot; || String.IsNullOrWhiteSpace(x.InnerHtml)))) // Or contains only a &quot;br&quot; (and possibly whitespace either side)
    {
        // Change to a break
        pTags[i].InnerHtml = &quot;&lt;br&gt;&quot;;
    }
    else
        continue;

    // If this is not the first p tag
    if (i &gt; 0)
    {
        // Check if current tag and previous tag both contain a line break and if so then remove current tag
        if (pTags[i].InnerHtml == &quot;&lt;br&gt;&quot; &amp;&amp; pTags[i - 1].InnerHtml == &quot;&lt;br&gt;&quot;)
            doc.DocumentNode.RemoveChild(pTags[i]);
    }
}

// Return the modified html
return doc.DocumentNode.OuterHtml;

huangapple
  • 本文由 发表于 2023年2月10日 05:56:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404836.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定