英文:
X-ray. How to parse non-nested structure into array of objects?
问题
以下是您要翻译的内容:
"I'm trying to collect data with x-ray from a page that structured like:
<h1>Page title</h1>
<article>
<h2 id="first">Title 1</h2>
<h3>Subtitle 1</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
<h2 id="second">Title 2</h2>
<h3>Subtitle 2</h3>
<h2 id="third">Title 3</h2>
<h3>Subtitle 3</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
</article>
The article is split in sections with <h2>
. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:
type Result = {
pageTitle: string,
sections: [{ subtitle?: string, elements?: string[] }],
}
From that example structure I expect output:
{
pageTitle: "Page title",
sections: [
{
subtitle: "Subtitle 1",
elements: ["Element1", "Element2", "Element3"]
},
{
subtitle: "Subtitle 2",
elements: [] //or any falsy value
},
{
subtitle: "Subtitle 3",
elements: ["Element1", "Element2", "Element3"]
}
]
}
I've tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("article", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
But I've figured out that it doesn't work as expected because there is only one article
tag on the page and []
indicates that xray will iterate over whatever selector (article
in my case) returns
I've also tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("h2", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
This returns 0 results, probably because xray("h2", /* other code */)
"scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.
So is there a way to get array of objects from a non-nested html structure?"
英文:
I'm trying to collect data with x-ray from a page that structured like:
<h1>Page title</h1>
<article>
<h2 id="first">Title 1</h2>
<h3>Subtitle 1</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
<h2 id="second">Title 2</h2>
<h3>Subtitle 2</h3>
<h2 id="third">Title 3</h2>
<h3>Subtitle 3</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
</article>
The article is split in sections with <h2>
. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:
type Result = {
pageTitle: string,
sections: [{ subtitle?: string, elements?: string[] }],
}
From that example structure I expect output:
{
pageTitle: "Page title",
sections: [
{
subtitle: "Subtitle 1",
elements: ["Element1", "Element2", "Element3"]
},
{
subtitle: "Subtitle 2",
elements: [] //or any falsy value
},
{
subtitle: "Subtitle 3",
elements: ["Element1", "Element2", "Element3"]
}
]
}
I've tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("article", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
But I've figured out that it doesn't work as expected because there is only one article
tag on the page and []
indicates that xray will iterate over whatever selector (article
in my case) returns
I've also tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("h2", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
This returns 0 results, probably because xray("h2", /* other code */)
"scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.
So is there a way to get array of objects from a non-nested html structure?
答案1
得分: 1
X-ray库没有提供一种轻松捕获其语法中的兄弟元素的方法。它主要用于父子关系,而您尝试抓取的结构不符合该模式。
理想情况下,像Puppeteer这样的库更合适,但带有jsdom的x-ray也可以处理它。
解决方案是预处理HTML,将每个部分封装在单独的容器中,然后使用x-ray抓取新的结构。
步骤:
- 将HTML加载到jsdom中。
- 遍历每个<h2>元素,收集<h3>和<ul>兄弟元素,直到下一个<h2>。
- 将每个组包装在新的<div>中。
- 将HTML传递给x-ray。
代码:
const { JSDOM } = require("jsdom");
const xray = require("x-ray")();
const html = /* 您的HTML内容在此 */;
const dom = new JSDOM(html);
const document = dom.window.document;
let currentDiv;
document.querySelectorAll("h2").forEach((h2, index) => {
if (index === 0) {
currentDiv = document.createElement("div");
h2.parentNode.insertBefore(currentDiv, h2);
} else {
currentDiv = document.createElement("div");
currentDiv.appendChild(document.createElement("br")); // 分隔符
h2.parentNode.insertBefore(currentDiv, h2);
}
let sibling = h2;
do {
currentDiv.appendChild(sibling);
sibling = sibling.nextElementSibling;
} while (sibling && sibling.tagName !== "H2");
});
const processedHtml = document.body.innerHTML;
xray(processedHtml, {
pageTitle: "h1 | trim",
sections: xray("div", [{
subtitle: "h3",
elements: ["ul li"]
}])
})((err, result) => {
console.log(result);
});
它广泛使用DOM操作,这可能不是理想的方式。
英文:
X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.
Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.
The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.
Steps:
- Load the HTML into a jsdom
- Iterate over each <h2> element, collect <h3> and <ul> siblings until the next <h2>
- Wrap each group in a new <div>
- Pass the HTML to x-ray
Code:
const { JSDOM } = require("jsdom");
const xray = require("x-ray")();
const html = /* Your HTML here */;
const dom = new JSDOM(html);
const document = dom.window.document;
let currentDiv;
document.querySelectorAll("h2").forEach((h2, index) => {
if (index === 0) {
currentDiv = document.createElement("div");
h2.parentNode.insertBefore(currentDiv, h2);
} else {
currentDiv = document.createElement("div");
currentDiv.appendChild(document.createElement("br")); // Separator
h2.parentNode.insertBefore(currentDiv, h2);
}
let sibling = h2;
do {
currentDiv.appendChild(sibling);
sibling = sibling.nextElementSibling;
} while (sibling && sibling.tagName !== "H2");
});
const processedHtml = document.body.innerHTML;
xray(processedHtml, {
pageTitle: "h1 | trim",
sections: xray("div", [{
subtitle: "h3",
elements: ["ul li"]
}])
})((err, result) => {
console.log(result);
});
It makes extensive use of DOM manipulation which may not be ideal
答案2
得分: 0
潜在方法 - 我还建议添加一些代码来帮助实现该方法
xray(url, {
pageTitle: "h1 | trim",
sections: xray("h2", [{
subtitle: "+ h3",
elements: ['+ h3 + ul li']
}])
});
英文:
Potential Approach - i would also suggest some more code to help with the approach
xray(url, {
pageTitle: "h1 | trim",
sections: xray("h2", [{
subtitle: "+ h3",
elements: ['+ h3 + ul li']
}])
});
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论