X-ray. 如何将非嵌套结构解析为对象数组?

huangapple go评论119阅读模式
英文:

X-ray. How to parse non-nested structure into array of objects?

问题

以下是您要翻译的内容:

"I'm trying to collect data with x-ray from a page that structured like:

  1. <h1>Page title</h1>
  2. <article>
  3. <h2 id="first">Title 1</h2>
  4. <h3>Subtitle 1</h3>
  5. <ul>
  6. <li>Element 1
  7. <li>Element 2
  8. <li>Element 3
  9. </ul>
  10. <h2 id="second">Title 2</h2>
  11. <h3>Subtitle 2</h3>
  12. <h2 id="third">Title 3</h2>
  13. <h3>Subtitle 3</h3>
  14. <ul>
  15. <li>Element 1
  16. <li>Element 2
  17. <li>Element 3
  18. </ul>
  19. </article>

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

  1. type Result = {
  2. pageTitle: string,
  3. sections: [{ subtitle?: string, elements?: string[] }],
  4. }

From that example structure I expect output:

  1. {
  2. pageTitle: "Page title",
  3. sections: [
  4. {
  5. subtitle: "Subtitle 1",
  6. elements: ["Element1", "Element2", "Element3"]
  7. },
  8. {
  9. subtitle: "Subtitle 2",
  10. elements: [] //or any falsy value
  11. },
  12. {
  13. subtitle: "Subtitle 3",
  14. elements: ["Element1", "Element2", "Element3"]
  15. }
  16. ]
  17. }

I've tried:

  1. xray(url, {
  2. pageTitle: "h1 | trim", //where trim is defined filter
  3. sections: xray("article", [{
  4. subtitle: "h3",
  5. elements: ['h3 ~ ul li']
  6. }])
  7. })

But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns

I've also tried:

  1. xray(url, {
  2. pageTitle: "h1 | trim", //where trim is defined filter
  3. sections: xray("h2", [{
  4. subtitle: "h3",
  5. elements: ['h3 ~ ul li']
  6. }])
  7. })

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?"

英文:

I'm trying to collect data with x-ray from a page that structured like:

  1. <h1>Page title</h1>
  2. <article>
  3. <h2 id="first">Title 1</h2>
  4. <h3>Subtitle 1</h3>
  5. <ul>
  6. <li>Element 1
  7. <li>Element 2
  8. <li>Element 3
  9. </ul>
  10. <h2 id="second">Title 2</h2>
  11. <h3>Subtitle 2</h3>
  12. <h2 id="third">Title 3</h2>
  13. <h3>Subtitle 3</h3>
  14. <ul>
  15. <li>Element 1
  16. <li>Element 2
  17. <li>Element 3
  18. </ul>
  19. </article>

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

  1. type Result = {
  2. pageTitle: string,
  3. sections: [{ subtitle?: string, elements?: string[] }],
  4. }

From that example structure I expect output:

  1. {
  2. pageTitle: "Page title",
  3. sections: [
  4. {
  5. subtitle: "Subtitle 1",
  6. elements: ["Element1", "Element2", "Element3"]
  7. },
  8. {
  9. subtitle: "Subtitle 2",
  10. elements: [] //or any falsy value
  11. },
  12. {
  13. subtitle: "Subtitle 3",
  14. elements: ["Element1", "Element2", "Element3"]
  15. }
  16. ]
  17. }

I've tried:

  1. xray(url, {
  2. pageTitle: "h1 | trim", //where trim is defined filter
  3. sections: xray("article", [{
  4. subtitle: "h3",
  5. elements: ['h3 ~ ul li']
  6. }])
  7. })

But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns

I've also tried:

  1. xray(url, {
  2. pageTitle: "h1 | trim", //where trim is defined filter
  3. sections: xray("h2", [{
  4. subtitle: "h3",
  5. elements: ['h3 ~ ul li']
  6. }])
  7. })

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?

答案1

得分: 1

X-ray库没有提供一种轻松捕获其语法中的兄弟元素的方法。它主要用于父子关系,而您尝试抓取的结构不符合该模式。

理想情况下,像Puppeteer这样的库更合适,但带有jsdom的x-ray也可以处理它。

解决方案是预处理HTML,将每个部分封装在单独的容器中,然后使用x-ray抓取新的结构。

步骤:

  1. 将HTML加载到jsdom中。
  2. 遍历每个<h2>元素,收集<h3>和<ul>兄弟元素,直到下一个<h2>。
  3. 将每个组包装在新的<div>中。
  4. 将HTML传递给x-ray。

代码:

  1. const { JSDOM } = require("jsdom");
  2. const xray = require("x-ray")();
  3. const html = /* 您的HTML内容在此 */;
  4. const dom = new JSDOM(html);
  5. const document = dom.window.document;
  6. let currentDiv;
  7. document.querySelectorAll("h2").forEach((h2, index) => {
  8. if (index === 0) {
  9. currentDiv = document.createElement("div");
  10. h2.parentNode.insertBefore(currentDiv, h2);
  11. } else {
  12. currentDiv = document.createElement("div");
  13. currentDiv.appendChild(document.createElement("br")); // 分隔符
  14. h2.parentNode.insertBefore(currentDiv, h2);
  15. }
  16. let sibling = h2;
  17. do {
  18. currentDiv.appendChild(sibling);
  19. sibling = sibling.nextElementSibling;
  20. } while (sibling && sibling.tagName !== "H2");
  21. });
  22. const processedHtml = document.body.innerHTML;
  23. xray(processedHtml, {
  24. pageTitle: "h1 | trim",
  25. sections: xray("div", [{
  26. subtitle: "h3",
  27. elements: ["ul li"]
  28. }])
  29. })((err, result) => {
  30. console.log(result);
  31. });

它广泛使用DOM操作,这可能不是理想的方式。

英文:

X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.

Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.

The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.

Steps:

  1. Load the HTML into a jsdom
  2. Iterate over each <h2> element, collect <h3> and <ul> siblings until the next <h2>
  3. Wrap each group in a new <div>
  4. Pass the HTML to x-ray

Code:

  1. const { JSDOM } = require(&quot;jsdom&quot;);
  2. const xray = require(&quot;x-ray&quot;)();
  3. const html = /* Your HTML here */;
  4. const dom = new JSDOM(html);
  5. const document = dom.window.document;
  6. let currentDiv;
  7. document.querySelectorAll(&quot;h2&quot;).forEach((h2, index) =&gt; {
  8. if (index === 0) {
  9. currentDiv = document.createElement(&quot;div&quot;);
  10. h2.parentNode.insertBefore(currentDiv, h2);
  11. } else {
  12. currentDiv = document.createElement(&quot;div&quot;);
  13. currentDiv.appendChild(document.createElement(&quot;br&quot;)); // Separator
  14. h2.parentNode.insertBefore(currentDiv, h2);
  15. }
  16. let sibling = h2;
  17. do {
  18. currentDiv.appendChild(sibling);
  19. sibling = sibling.nextElementSibling;
  20. } while (sibling &amp;&amp; sibling.tagName !== &quot;H2&quot;);
  21. });
  22. const processedHtml = document.body.innerHTML;
  23. xray(processedHtml, {
  24. pageTitle: &quot;h1 | trim&quot;,
  25. sections: xray(&quot;div&quot;, [{
  26. subtitle: &quot;h3&quot;,
  27. elements: [&quot;ul li&quot;]
  28. }])
  29. })((err, result) =&gt; {
  30. console.log(result);
  31. });

It makes extensive use of DOM manipulation which may not be ideal

答案2

得分: 0

潜在方法 - 我还建议添加一些代码来帮助实现该方法

  1. xray(url, {
  2. pageTitle: "h1 | trim",
  3. sections: xray("h2", [{
  4. subtitle: "+ h3",
  5. elements: ['+ h3 + ul li']
  6. }])
  7. });
英文:

Potential Approach - i would also suggest some more code to help with the approach

  1. xray(url, {
  2. pageTitle: &quot;h1 | trim&quot;,
  3. sections: xray(&quot;h2&quot;, [{
  4. subtitle: &quot;+ h3&quot;,
  5. elements: [&#39;+ h3 + ul li&#39;]
  6. }])
  7. });

huangapple
  • 本文由 发表于 2023年7月31日 22:51:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76804784.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定