2023年7月31日 22:51:13go评论119阅读模式

英文:

X-ray. How to parse non-nested structure into array of objects?

问题

以下是您要翻译的内容：

"I'm trying to collect data with x-ray from a page that structured like:

&lt;h1&gt;Page title&lt;/h1&gt;
&lt;article&gt;
  &lt;h2 id=&quot;first&quot;&gt;Title 1&lt;/h2&gt;
  &lt;h3&gt;Subtitle 1&lt;/h3&gt;
  &lt;ul&gt;
    &lt;li&gt;Element 1
    &lt;li&gt;Element 2
    &lt;li&gt;Element 3
  &lt;/ul&gt;
  &lt;h2 id=&quot;second&quot;&gt;Title 2&lt;/h2&gt;
  &lt;h3&gt;Subtitle 2&lt;/h3&gt;
  &lt;h2 id=&quot;third&quot;&gt;Title 3&lt;/h2&gt;
  &lt;h3&gt;Subtitle 3&lt;/h3&gt;
  &lt;ul&gt;
    &lt;li&gt;Element 1
    &lt;li&gt;Element 2
    &lt;li&gt;Element 3
  &lt;/ul&gt;
&lt;/article&gt;

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

type Result = { 
  pageTitle: string,
  sections: [{ subtitle?: string, elements?: string[] }],
}

From that example structure I expect output:

{
  pageTitle: &quot;Page title&quot;,
  sections: [
    {
      subtitle: &quot;Subtitle 1&quot;,
      elements: [&quot;Element1&quot;, &quot;Element2&quot;, &quot;Element3&quot;]
    },
    {
      subtitle: &quot;Subtitle 2&quot;,   
      elements: [] //or any falsy value
    },
    {
      subtitle: &quot;Subtitle 3&quot;,
      elements: [&quot;Element1&quot;, &quot;Element2&quot;, &quot;Element3&quot;]
    }
  ]
}

I've tried:

xray(url, {
  pageTitle: &quot;h1 | trim&quot;, //where trim is defined filter
  sections: xray(&quot;article&quot;, [{
    subtitle: &quot;h3&quot;,
    elements: [&#39;h3 ~ ul li&#39;]
  }])
})

But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns

I've also tried:

xray(url, {
  pageTitle: &quot;h1 | trim&quot;, //where trim is defined filter
  sections: xray(&quot;h2&quot;, [{
    subtitle: &quot;h3&quot;,
    elements: [&#39;h3 ~ ul li&#39;]
  }])
})

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?"

英文:

I'm trying to collect data with x-ray from a page that structured like:

&lt;h1&gt;Page title&lt;/h1&gt;
&lt;article&gt;
  &lt;h2 id=&quot;first&quot;&gt;Title 1&lt;/h2&gt;
  &lt;h3&gt;Subtitle 1&lt;/h3&gt;
  &lt;ul&gt;
    &lt;li&gt;Element 1
    &lt;li&gt;Element 2
    &lt;li&gt;Element 3
  &lt;/ul&gt;
  &lt;h2 id=&quot;second&quot;&gt;Title 2&lt;/h2&gt;
  &lt;h3&gt;Subtitle 2&lt;/h3&gt;
  &lt;h2 id=&quot;third&quot;&gt;Title 3&lt;/h2&gt;
  &lt;h3&gt;Subtitle 3&lt;/h3&gt;
  &lt;ul&gt;
    &lt;li&gt;Element 1
    &lt;li&gt;Element 2
    &lt;li&gt;Element 3
  &lt;/ul&gt;
&lt;/article&gt;

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

type Result = { 
  pageTitle: string,
  sections: [{ subtitle?: string, elements?: string[] }],
}

From that example structure I expect output:

{
  pageTitle: &quot;Page title&quot;,
  sections: [
    {
      subtitle: &quot;Subtitle 1&quot;,
      elements: [&quot;Element1&quot;, &quot;Element2&quot;, &quot;Element3&quot;]
    },
    {
      subtitle: &quot;Subtitle 2&quot;,   
      elements: [] //or any falsy value
    },
    {
      subtitle: &quot;Subtitle 3&quot;,
      elements: [&quot;Element1&quot;, &quot;Element2&quot;, &quot;Element3&quot;]
    }
  ]
}

I've tried:

xray(url, {
  pageTitle: &quot;h1 | trim&quot;, //where trim is defined filter
  sections: xray(&quot;article&quot;, [{
    subtitle: &quot;h3&quot;,
    elements: [&#39;h3 ~ ul li&#39;]
  }])
})

I've also tried:

xray(url, {
  pageTitle: &quot;h1 | trim&quot;, //where trim is defined filter
  sections: xray(&quot;h2&quot;, [{
    subtitle: &quot;h3&quot;,
    elements: [&#39;h3 ~ ul li&#39;]
  }])
})

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?

答案1

得分: 1

X-ray库没有提供一种轻松捕获其语法中的兄弟元素的方法。它主要用于父子关系，而您尝试抓取的结构不符合该模式。

理想情况下，像Puppeteer这样的库更合适，但带有jsdom的x-ray也可以处理它。

解决方案是预处理HTML，将每个部分封装在单独的容器中，然后使用x-ray抓取新的结构。

步骤：

将HTML加载到jsdom中。
遍历每个<h2>元素，收集<h3>和<ul>兄弟元素，直到下一个<h2>。
将每个组包装在新的<div>中。
将HTML传递给x-ray。

代码：

const { JSDOM } = require("jsdom");
const xray = require("x-ray")();
const html = /* 您的HTML内容在此 */;
const dom = new JSDOM(html);
const document = dom.window.document;
let currentDiv;
document.querySelectorAll("h2").forEach((h2, index) => {
  if (index === 0) {
    currentDiv = document.createElement("div");
    h2.parentNode.insertBefore(currentDiv, h2);
  } else {
    currentDiv = document.createElement("div");
    currentDiv.appendChild(document.createElement("br")); // 分隔符
    h2.parentNode.insertBefore(currentDiv, h2);
  }
  let sibling = h2;
  do {
    currentDiv.appendChild(sibling);
    sibling = sibling.nextElementSibling;
  } while (sibling && sibling.tagName !== "H2");
});
const processedHtml = document.body.innerHTML;
xray(processedHtml, {
  pageTitle: "h1 | trim",
  sections: xray("div", [{
    subtitle: "h3",
    elements: ["ul li"]
  }])
})((err, result) => {
  console.log(result);
});

它广泛使用DOM操作，这可能不是理想的方式。

英文:

X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.

Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.

The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.

Steps:

Load the HTML into a jsdom
Iterate over each <h2> element, collect <h3> and <ul> siblings until the next <h2>
Wrap each group in a new <div>
Pass the HTML to x-ray

Code:

    const { JSDOM } = require(&quot;jsdom&quot;);
    const xray = require(&quot;x-ray&quot;)();
    
    const html = /* Your HTML here */;
    
    const dom = new JSDOM(html);
    const document = dom.window.document;
    
    let currentDiv;
    document.querySelectorAll(&quot;h2&quot;).forEach((h2, index) =&gt; {
      if (index === 0) {
        currentDiv = document.createElement(&quot;div&quot;);
        h2.parentNode.insertBefore(currentDiv, h2);
      } else {
        currentDiv = document.createElement(&quot;div&quot;);
        currentDiv.appendChild(document.createElement(&quot;br&quot;)); // Separator
        h2.parentNode.insertBefore(currentDiv, h2);
      }
    
      let sibling = h2;
      do {
        currentDiv.appendChild(sibling);
        sibling = sibling.nextElementSibling;
      } while (sibling &amp;&amp; sibling.tagName !== &quot;H2&quot;);
    });
    
    const processedHtml = document.body.innerHTML;
    
    xray(processedHtml, {
      pageTitle: &quot;h1 | trim&quot;,
      sections: xray(&quot;div&quot;, [{
        subtitle: &quot;h3&quot;,
        elements: [&quot;ul li&quot;]
      }])
    })((err, result) =&gt; {
      console.log(result);
    });

It makes extensive use of DOM manipulation which may not be ideal

答案2

得分: 0

潜在方法 - 我还建议添加一些代码来帮助实现该方法

xray(url, {
  pageTitle: "h1 | trim",
  sections: xray("h2", [{
    subtitle: "+ h3",
    elements: ['+ h3 + ul li']
  }])
});

英文:

Potential Approach - i would also suggest some more code to help with the approach

xray(url, {
  pageTitle: &quot;h1 | trim&quot;,
  sections: xray(&quot;h2&quot;, [{
    subtitle: &quot;+ h3&quot;,
    elements: [&#39;+ h3 + ul li&#39;]
  }])
});

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

X-ray. 如何将非嵌套结构解析为对象数组？

问题

答案1

答案2

如何管理与MySQL实例和多个数据库的数据库连接？

I need to programmatically fill out a PDF form's radio buttons, can I do this in JS, go, or python?

显示不同的元素，如果该类具有类 active。

替换嵌套对象的值如果键存在。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。