2023年4月20日 06:08:09go评论83阅读模式

英文:

Why is DOMDocument converting both html quote-entities to actual quotes?

问题

I've been at this for half a day, so now it's time to ask for help.
我已经忙了半天了，所以现在是时候寻求帮助了。

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.
我希望DOMDocument保留现有的实体和UTF-8字符。我现在认为仅使用DOMDocument可能无法实现这一点。

Then I run:
然后我运行：

And get entity output:
然后得到实体输出：

Why is DOMDocument converting &#39; and &quot; to actual quote marks? The only thing it didn't touch was &lt;.
为什么DOMDocument将&#39;和&quot;转换为实际的引号？它唯一没有触及的是&lt;。

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.
相当肯定，版权符号被转换是因为DOMDocument认为输入的HTML不是UTF-8，但我非常困惑为什么它将引号转换回非实体形式。

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.
我认为mb_convert_encoding 技巧会解决UTF-8问题，但它没有。

英文:

I've been at this for half a day, so now it's time to ask for help.

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.

$html =
&#39;&lt;!doctype html&gt;
&lt;html lang=&quot;en&quot;&gt;
    &lt;head&gt;
        &lt;meta charset=&quot;utf-8&quot;&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;&amp;#39; &amp;quot; &amp; &amp;lt; &#169; 庭&lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;&#39;;

Then I run:

$dom = new DOMDocument();
$dom-&gt;loadHTML($html, LIBXML_NOERROR);
echo $dom-&gt;saveHTML();

And get entity output:

input: &amp;#39; &amp;quot; &amp; &amp;lt; &#169; 庭
output: &#39; &quot; &amp;amp; &amp;lt; &amp;copy; &amp;#24237;

Why is DOMDocument converting &#39; and &quot; to actual quote marks? The only thing it didn't touch was &lt;.

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.

Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html); trick.

答案1

得分: 1

I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

require 'vendor/autoload.php';
$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);
echo $dom->saveHTML();

Result:

input: &#39; &amp;quot; &amp;lt; &#169; 庭 &amp;nbsp; &amp;
output: &#39; &amp;quot; &amp;lt; &#169; 庭 &amp;nbsp; &amp;amp;

英文:

I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

require &#39;vendor/autoload.php&#39;;
$dom = new IvoPetkov\HTML5DOMDocument();
$dom-&gt;loadHTML($html, LIBXML_NOERROR);
echo $dom-&gt;saveHTML();

Result:

input: &amp;#39; &amp;quot; &amp;lt; &#169; 庭 &amp;nbsp; &amp;
output: &amp;#39; &amp;quot; &amp;lt; &#169; 庭 &amp;nbsp; &amp;amp;

答案2

得分: 0

你需要为 saveHTML() 方法提供一个特定的元素。这将使它以最简化的方式对实体进行编码。它仍然会对必要的实体进行编码。我不认为有一种方法可以完全阻止所有实体的编码，但它不会尝试对每个实体进行编码。

$html = $dom->saveHTML($dom);
// &#39; &quot; &amp;amp; &amp;lt; &#169; 庭

英文:

You need to provide a specific element to the saveHTML() method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don't think there's a way to prevent all entity encoding from happening, but it won't try to encode every entity it can.

$html = $dom-&gt;saveHTML($dom);
// &#39; &quot; &amp;amp; &amp;lt; &#169; 庭

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

DOMDocument为什么将HTML引号实体转换为实际引号？

问题

答案1

答案2

Laravel请求验证：检查时间是否在两个时间之间？

在一个循环中显示未序列化数组的问题

Exclude sitemap.xml from htaccess.

以编程方式将新税类添加到WooCommerce的“附加税类”中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。