X-ray. 如何将非嵌套结构解析为对象数组?

huangapple go评论84阅读模式
英文:

X-ray. How to parse non-nested structure into array of objects?

问题

以下是您要翻译的内容:

"I'm trying to collect data with x-ray from a page that structured like:

<h1>Page title</h1>
<article>
  <h2 id="first">Title 1</h2>
  <h3>Subtitle 1</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
  <h2 id="second">Title 2</h2>
  <h3>Subtitle 2</h3>
  <h2 id="third">Title 3</h2>
  <h3>Subtitle 3</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
</article>

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

type Result = { 
  pageTitle: string,
  sections: [{ subtitle?: string, elements?: string[] }],
}

From that example structure I expect output:

{
  pageTitle: "Page title",
  sections: [
    {
      subtitle: "Subtitle 1",
      elements: ["Element1", "Element2", "Element3"]
    },
    {
      subtitle: "Subtitle 2",   
      elements: [] //or any falsy value
    },
    {
      subtitle: "Subtitle 3",
      elements: ["Element1", "Element2", "Element3"]
    }
  ]
}

I've tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("article", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns

I've also tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("h2", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?"

英文:

I'm trying to collect data with x-ray from a page that structured like:

<h1>Page title</h1>
<article>
  <h2 id="first">Title 1</h2>
  <h3>Subtitle 1</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
  <h2 id="second">Title 2</h2>
  <h3>Subtitle 2</h3>
  <h2 id="third">Title 3</h2>
  <h3>Subtitle 3</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
</article>

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

type Result = { 
  pageTitle: string,
  sections: [{ subtitle?: string, elements?: string[] }],
}

From that example structure I expect output:

{
  pageTitle: "Page title",
  sections: [
    {
      subtitle: "Subtitle 1",
      elements: ["Element1", "Element2", "Element3"]
    },
    {
      subtitle: "Subtitle 2",   
      elements: [] //or any falsy value
    },
    {
      subtitle: "Subtitle 3",
      elements: ["Element1", "Element2", "Element3"]
    }
  ]
}

I've tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("article", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns

I've also tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("h2", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?

答案1

得分: 1

X-ray库没有提供一种轻松捕获其语法中的兄弟元素的方法。它主要用于父子关系,而您尝试抓取的结构不符合该模式。

理想情况下,像Puppeteer这样的库更合适,但带有jsdom的x-ray也可以处理它。

解决方案是预处理HTML,将每个部分封装在单独的容器中,然后使用x-ray抓取新的结构。

步骤:

  1. 将HTML加载到jsdom中。
  2. 遍历每个<h2>元素,收集<h3>和<ul>兄弟元素,直到下一个<h2>。
  3. 将每个组包装在新的<div>中。
  4. 将HTML传递给x-ray。

代码:

const { JSDOM } = require("jsdom");
const xray = require("x-ray")();

const html = /* 您的HTML内容在此 */;

const dom = new JSDOM(html);
const document = dom.window.document;

let currentDiv;
document.querySelectorAll("h2").forEach((h2, index) => {
  if (index === 0) {
    currentDiv = document.createElement("div");
    h2.parentNode.insertBefore(currentDiv, h2);
  } else {
    currentDiv = document.createElement("div");
    currentDiv.appendChild(document.createElement("br")); // 分隔符
    h2.parentNode.insertBefore(currentDiv, h2);
  }

  let sibling = h2;
  do {
    currentDiv.appendChild(sibling);
    sibling = sibling.nextElementSibling;
  } while (sibling && sibling.tagName !== "H2");
});

const processedHtml = document.body.innerHTML;

xray(processedHtml, {
  pageTitle: "h1 | trim",
  sections: xray("div", [{
    subtitle: "h3",
    elements: ["ul li"]
  }])
})((err, result) => {
  console.log(result);
});

它广泛使用DOM操作,这可能不是理想的方式。

英文:

X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.

Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.

The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.

Steps:

  1. Load the HTML into a jsdom
  2. Iterate over each <h2> element, collect <h3> and <ul> siblings until the next <h2>
  3. Wrap each group in a new <div>
  4. Pass the HTML to x-ray

Code:

    const { JSDOM } = require(&quot;jsdom&quot;);
    const xray = require(&quot;x-ray&quot;)();
    
    const html = /* Your HTML here */;
    
    const dom = new JSDOM(html);
    const document = dom.window.document;
    
    let currentDiv;
    document.querySelectorAll(&quot;h2&quot;).forEach((h2, index) =&gt; {
      if (index === 0) {
        currentDiv = document.createElement(&quot;div&quot;);
        h2.parentNode.insertBefore(currentDiv, h2);
      } else {
        currentDiv = document.createElement(&quot;div&quot;);
        currentDiv.appendChild(document.createElement(&quot;br&quot;)); // Separator
        h2.parentNode.insertBefore(currentDiv, h2);
      }
    
      let sibling = h2;
      do {
        currentDiv.appendChild(sibling);
        sibling = sibling.nextElementSibling;
      } while (sibling &amp;&amp; sibling.tagName !== &quot;H2&quot;);
    });
    
    const processedHtml = document.body.innerHTML;
    
    xray(processedHtml, {
      pageTitle: &quot;h1 | trim&quot;,
      sections: xray(&quot;div&quot;, [{
        subtitle: &quot;h3&quot;,
        elements: [&quot;ul li&quot;]
      }])
    })((err, result) =&gt; {
      console.log(result);
    });

It makes extensive use of DOM manipulation which may not be ideal

答案2

得分: 0

潜在方法 - 我还建议添加一些代码来帮助实现该方法

xray(url, {
  pageTitle: "h1 | trim",
  sections: xray("h2", [{
    subtitle: "+ h3",
    elements: ['+ h3 + ul li']
  }])
});
英文:

Potential Approach - i would also suggest some more code to help with the approach

xray(url, {
  pageTitle: &quot;h1 | trim&quot;,
  sections: xray(&quot;h2&quot;, [{
    subtitle: &quot;+ h3&quot;,
    elements: [&#39;+ h3 + ul li&#39;]
  }])
});

huangapple
  • 本文由 发表于 2023年7月31日 22:51:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76804784.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定