DOMDocument为什么将HTML引号实体转换为实际引号?

huangapple go评论64阅读模式
英文:

Why is DOMDocument converting both html quote-entities to actual quotes?

问题

I've been at this for half a day, so now it's time to ask for help.
我已经忙了半天了,所以现在是时候寻求帮助了。

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.
我希望DOMDocument保留现有的实体和UTF-8字符。我现在认为仅使用DOMDocument可能无法实现这一点。

Then I run:
然后我运行:

And get entity output:
然后得到实体输出:

Why is DOMDocument converting ' and " to actual quote marks? The only thing it didn't touch was <.
为什么DOMDocument将'"转换为实际的引号?它唯一没有触及的是<

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.
相当肯定,版权符号被转换是因为DOMDocument认为输入的HTML不是UTF-8,但我非常困惑为什么它将引号转换回非实体形式。

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.
我认为mb_convert_encoding 技巧会解决UTF-8问题,但它没有。

英文:

I've been at this for half a day, so now it's time to ask for help.

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.

$html =
'<!doctype html>
<html lang="en">
    <head>
        <meta charset="utf-8">
    </head>
    <body>
        <p>' " & < © 庭</p>
    </body>
</html>';

Then I run:

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

And get entity output:

input: ' " & < © 庭
output: ' " & < © 庭

Why is DOMDocument converting ' and " to actual quote marks? The only thing it didn't touch was <.

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.

Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html); trick.

答案1

得分: 1

I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

require 'vendor/autoload.php';

$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

Result:

input: ' " < © 庭   &
output: ' " < © 庭   &
英文:

I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

require 'vendor/autoload.php';

$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

Result:

input: ' " < © 庭   &
output: ' " < © 庭   &

答案2

得分: 0

你需要为 saveHTML() 方法提供一个特定的元素。这将使它以最简化的方式对实体进行编码。它仍然会对必要的实体进行编码。我不认为有一种方法可以完全阻止所有实体的编码,但它不会尝试对每个实体进行编码。

$html = $dom->saveHTML($dom);
// ' " & < © 庭
英文:

You need to provide a specific element to the saveHTML() method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don't think there's a way to prevent all entity encoding from happening, but it won't try to encode every entity it can.

$html = $dom->saveHTML($dom);
// ' " & < © 庭

huangapple
  • 本文由 发表于 2023年4月20日 06:08:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76059153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定