有没有办法使用PHP的DOMDocument来解析包含HTML字符串的JavaScript的HTML?

huangapple go评论64阅读模式
英文:

Is there a way to use PHP's DOMDocument to parse HTML containing javascript which itself contains HTML strings?

问题

我有一个包含<script>标签的HTML字符串,该标签包含通过window.customElements.define(...)创建阴影DOM元素的JavaScript代码,这反过来包含一个innerHTML定义,用字符串定义自定义元素的HTML。

这是有效的HTML,我正在尝试使用PHP的DOMDocument处理,但是似乎DOMDocument被innerHTML的内容弄糊涂了,开始将其内容视为需要处理的节点。

有没有办法解决这个问题,使DOMDocument不再混淆?

HTML的相关部分看起来有点像这样:

&lt;script&gt;
class ExampleElement extends HTMLElement {
   constructor() {
      super();
      this.attachShadow({ mode: 'open' })
          .innerHTML = '&lt;label&gt;这是混淆DOMDocument的部分&lt;/label&gt;';
  }
}
window.customElements.define('example-element', ExampleElement);
&lt;/script&gt;

然后在PHP中进行如下处理:

$doc = new DOMDocument();
$doc-&gt;loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

libxml然后会生成一个关于</label>不匹配的错误:"Entity中的意外结束标签:label"

显然,我可以要么

  • 拆分innerHTML,以便DOMDocument不再将<label>和</label>识别为标签,使用字符串连接
  • 通过document.createElement(...)等方式构建元素的内容

但由于这有效的HTML,因此知道它是否可以按原样解析将会很有用。

英文:

I have an HTML string containing a &lt;script&gt; tag which contains the javascript to create a shadow DOM element via window.customElements.define(...) this in turn contains an innerHTML definition which defines the custom element's HTML as a string.

This is valid HTML which I'm attempting to process using PHP's DOMDocument, however it appears that DOMDocument is confused by the content of the innerHTML and starts treating it's content as nodes it needs to process.

Is there any way to work around this so it no longer confuses DOMDocument?

the pertinent part of the HTML looks somewhat like this:

&lt;script&gt;
class ExampleElement extends HTMLElement {
   constructor() {
      super();
      this.attachShadow({ mode: &#39;open&#39; })
          .innerHTML = &#39;&lt;label&gt;this is what confuses DOMDocument&lt;/label&gt;&#39;
  }
}
window.customElements.define(&#39;example-element&#39;, ExampleElement);
&lt;/script&gt;

this is then processed in PHP like this

$doc = new DOMDocument();
$doc-&gt;loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

libxml then generates an error about the &lt;/label&gt; not matching : "Unexpected end tag : label in Entity"

obviously I can either
- break up the innerHTML so that DOMDocument no longer identifies the &lt;label&gt; and &lt;/label&gt; as tags using string concatenation
or
- build the element's content via document.createElement(...) etc

however since this is valid HTML it would be useful to know if it can be parsed as i stands.

答案1

得分: 1

根据 https://bugs.php.net/bug.php?id=80095

libxml 使用 HTML 4 规则,规定 </ 是一个结束标签,即使该标签不匹配最后一个开放的标签。为了避免这个问题,在你的脚本中将结束标签写为 "<\\/"

因此,请将 &lt;/label&gt; 更改为 &lt;\/label&gt;

这样它将被解析为干净的内容,JavaScript 应该将 \/ 解释为字符串中的字面 /

英文:

Per: https://bugs.php.net/bug.php?id=80095

> libxml uses HTML 4 rules which say that </ is an ending tag. Even if the tag doesn't match the last opening tag. To avoid this problem, write the ending tags in your script as "<\/".

So change &lt;/label&gt; to &lt;\/label&gt;.

It will parse clean and JS should interpret \/ as a literal / in the string.

答案2

得分: -2

$html = '&#39;&lt;script&gt;
class ExampleElement extends HTMLElement {
   constructor() {
      super();
      this.attachShadow({ mode: \&#39;open\&#39; })
          .innerHTML = \&#39;&lt;label&gt;this is what confuses DOMDocument&lt;/label&gt;\&#39;
  }
}
window.customElements.define(\&#39;example-element\&#39;, ExampleElement);
&lt;/script&gt;&#39;;


$doc = new DOMDocument();
libxml_use_internal_errors(true); // 禁用 libxml 错误
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

libxml_use_internal_errors(false); // 启用 libxml 错误

// 根据需要继续处理 DOMDocument 对象

<details>
<summary>英文:</summary>

```$html = &#39;&lt;script&gt;
class ExampleElement extends HTMLElement {
   constructor() {
      super();
      this.attachShadow({ mode: \&#39;open\&#39; })
          .innerHTML = \&#39;&lt;label&gt;this is what confuses DOMDocument&lt;/label&gt;\&#39;
  }
}
window.customElements.define(\&#39;example-element\&#39;, ExampleElement);
&lt;/script&gt;&#39;;

$doc = new DOMDocument();
libxml_use_internal_errors(true); // Disable libxml errors
$doc-&gt;loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

libxml_use_internal_errors(false); // Enable libxml errors 

// Continue processing the DOMDocument object as required

You can use the following code to parse html containing the javascript using PHPDOM Document

huangapple
  • 本文由 发表于 2023年6月15日 19:35:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76482066.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定