英文:
Why is DOMDocument converting both html quote-entities to actual quotes?
问题
I've been at this for half a day, so now it's time to ask for help.
我已经忙了半天了,所以现在是时候寻求帮助了。
What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.
我希望DOMDocument保留现有的实体和UTF-8字符。我现在认为仅使用DOMDocument可能无法实现这一点。
Then I run:
然后我运行:
And get entity output:
然后得到实体输出:
Why is DOMDocument converting '
and "
to actual quote marks? The only thing it didn't touch was <
.
为什么DOMDocument将'
和"
转换为实际的引号?它唯一没有触及的是<
。
Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.
相当肯定,版权符号被转换是因为DOMDocument认为输入的HTML不是UTF-8,但我非常困惑为什么它将引号转换回非实体形式。
I thought the mb_convert_encoding
trick would fix the utf-8 issue, but it hasn't.
我认为mb_convert_encoding
技巧会解决UTF-8问题,但它没有。
英文:
I've been at this for half a day, so now it's time to ask for help.
What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.
$html =
'<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<p>&#39; &quot; & &lt; © 庭</p>
</body>
</html>';
Then I run:
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);
echo $dom->saveHTML();
And get entity output:
input: &#39; &quot; & &lt; © 庭
output: ' " &amp; &lt; &copy; &#24237;
Why is DOMDocument converting &#39;
and &quot;
to actual quote marks? The only thing it didn't touch was &lt;
.
Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.
I thought the mb_convert_encoding
trick would fix the utf-8 issue, but it hasn't.
Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html);
trick.
答案1
得分: 1
I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.
require 'vendor/autoload.php';
$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);
echo $dom->saveHTML();
Result:
input: ' &quot; &lt; © 庭 &nbsp; &
output: ' &quot; &lt; © 庭 &nbsp; &amp;
英文:
I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.
require 'vendor/autoload.php';
$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);
echo $dom->saveHTML();
Result:
input: &#39; &quot; &lt; © 庭 &nbsp; &
output: &#39; &quot; &lt; © 庭 &nbsp; &amp;
答案2
得分: 0
你需要为 saveHTML()
方法提供一个特定的元素。这将使它以最简化的方式对实体进行编码。它仍然会对必要的实体进行编码。我不认为有一种方法可以完全阻止所有实体的编码,但它不会尝试对每个实体进行编码。
$html = $dom->saveHTML($dom);
// ' " &amp; &lt; © 庭
英文:
You need to provide a specific element to the saveHTML()
method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don't think there's a way to prevent all entity encoding from happening, but it won't try to encode every entity it can.
$html = $dom->saveHTML($dom);
// ' " &amp; &lt; © 庭
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论