
huangapple go评论83阅读模式

Why is DOMDocument converting both html quote-entities to actual quotes?


I've been at this for half a day, so now it's time to ask for help.

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.

Then I run:

And get entity output:

Why is DOMDocument converting ' and " to actual quote marks? The only thing it didn't touch was <.

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.
我认为mb_convert_encoding 技巧会解决UTF-8问题,但它没有。


I've been at this for half a day, so now it's time to ask for help.

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.

  1. $html =
  2. '<!doctype html>
  3. <html lang="en">
  4. <head>
  5. <meta charset="utf-8">
  6. </head>
  7. <body>
  8. <p>' " & < © 庭</p>
  9. </body>
  10. </html>';

Then I run:

  1. $dom = new DOMDocument();
  2. $dom->loadHTML($html, LIBXML_NOERROR);
  3. echo $dom->saveHTML();

And get entity output:

  1. input: ' " & < ©
  2. output: ' " & < © 庭

Why is DOMDocument converting ' and " to actual quote marks? The only thing it didn't touch was <.

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.

Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html); trick.


得分: 1

I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

  1. require 'vendor/autoload.php';
  2. $dom = new IvoPetkov\HTML5DOMDocument();
  3. $dom->loadHTML($html, LIBXML_NOERROR);
  4. echo $dom->saveHTML();


  1. input: ' " < ©   &
  2. output: ' " < ©   &

I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

  1. require 'vendor/autoload.php';
  2. $dom = new IvoPetkov\HTML5DOMDocument();
  3. $dom->loadHTML($html, LIBXML_NOERROR);
  4. echo $dom->saveHTML();


  1. input: ' " < ©   &
  2. output: ' " < ©   &


得分: 0

你需要为 saveHTML() 方法提供一个特定的元素。这将使它以最简化的方式对实体进行编码。它仍然会对必要的实体进行编码。我不认为有一种方法可以完全阻止所有实体的编码,但它不会尝试对每个实体进行编码。

  1. $html = $dom->saveHTML($dom);
  2. // ' " & < © 庭

You need to provide a specific element to the saveHTML() method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don't think there's a way to prevent all entity encoding from happening, but it won't try to encode every entity it can.

  1. $html = $dom->saveHTML($dom);
  2. // ' " & < © 庭

  • 本文由 发表于 2023年4月20日 06:08:09
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
