英文:
How does one extract any text-content from HTML-code which does contain whitespace but neither tab nor line-break?
问题
I can provide a translation for the text you've provided:
如何在任何 HTML 中查找并选择只包含空格但不包含制表符和换行符的文本,并且不选择标签本身。
从相反的方面来看,我成功了,但正如我上面所看到的 - 没有。
这是我得到的:
<[^>]+>(.+?)<\/[^>]+>
Please note that I've translated the text but omitted the code section, as per your request.
英文:
How to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves.
From the opposite, I succeeded, but as I looked above - no
<html>
<body>
<h1> text1</h1>
<p>text2</p>
text14
<p> text3 </p>
text2
</body>
</html>
This is what I got:
<[^>]+>(.+?)<\/[^>]+>
答案1
得分: 1
以下是翻译好的部分:
假设您想要
["text1", "text2", "text3"]
并且希望忽略带有制表符或换行符的节点
然后,您可以使用 parseFromString 和 createNodeIterator
并执行以下操作:
const htmlStr = `<html>
<body>
<h1> text1</h1>
<p>text2</p>
text14 is ignored due to newlines
<p> text3 </p>
text2
</body>
</html>`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, "text/html");
let currentNode,
nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);
const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
const text = currentNode.textContent;
const textHasTabsOrNewlines = text.match(/[\t\n]/);
console.log("text:>", currentNode.textContent, "<", textHasTabsOrNewlines)
const textOnly = text.trim();
if (textOnly !== "" && !textHasTabsOrNewlines) textArr.push(textOnly);
}
console.log(textArr);
希望这对您有帮助。
英文:
Assuming you wanted
["text1", "text2", "text3"]
and wanted to ignore the nodes with tabs or newlines
then you can use parseFromString and createNodeIterator
and do this:
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-js -->
const htmlStr = `<html>
<body>
<h1> text1</h1>
<p>text2</p>
text14 is ignored due to newlines
<p> text3 </p>
text2
</body>
</html>`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, "text/html");
let currentNode,
nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);
const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
const text = currentNode.textContent;
const textHasTabsOrNewlines = text.match(/[\t\n]/);
console.log("text:>", currentNode.textContent, "<", textHasTabsOrNewlines)
const textOnly = text.trim();
if (textOnly !== "" && !textHasTabsOrNewlines) textArr.push(textOnly);
}
console.log(textArr);
<!-- end snippet -->
答案2
得分: 1
以下是您要翻译的内容:
"The requirements as in the OP's own words ..."
> "how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"
The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser
and NodeIterator
based approach are capable of returning the OP's expected result.
But a NodeIterator
instance with an additionally applied filter
where the latter uses 2 regex pattern based test
s does the job ...
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-js -->
const code =
<html> <body> <h1>foo</h1> <!-- no pick ... not a single white space at all --> <p> bar </p> <!-- pick... ... simple spaces only --> baz <!-- no pick ... leading tab and new line --> <p>bizz</p> <!-- no pick ... not a single white space at all --> buzz <!-- no pick ... leading simple spaces and new line --> <p>booz </p> <!-- pick... ... simple spaces only --> </body> </html>
;
const dom = (new DOMParser)
.parseFromString(code, 'text/html');
const textNodeIterator =
document.createNodeIterator(
dom.documentElement,
NodeFilter.SHOW_TEXT,
node => (
(node.textContent.trim() !== '') && // - content other than just white space(s)
(/\s+/).test(node.textContent) && // - content with any kind of white space
!(/[\t\n]+/).test(node.textContent) // - content without tabs and new lines
)
? NodeFilter.FILTER_ACCEPT
: NodeFilter.FILTER_REJECT
);
const textContentList = [];
let textNode;
while (textNode = textNodeIterator.nextNode()) {
textContentList.push(textNode.textContent)
}
console.log({ textContentList });
<!-- language: lang-css -->
.as-console-wrapper { min-height: 100%!important; top: 0; }
<!-- end snippet -->
英文:
The requirements as in the OP's own words ...
> "how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"
The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser
and NodeIterator
based approach are capable of returning the OP's expected result.
But a NodeIterator
instance with an additionally applied filter
where the latter uses 2 regex pattern based test
s does the job ...
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-js -->
const code =
`<html>
<body>
<h1>foo</h1> <!-- no pick ... not a single white space at all -->
<p> bar </p> <!-- pick... ... simple spaces only -->
baz <!-- no pick ... leading tab and new line -->
<p>bizz</p> <!-- no pick ... not a single white space at all -->
buzz <!-- no pick ... leading simple spaces and new line -->
<p>booz </p> <!-- pick... ... simple spaces only -->
</body>
</html>`;
const dom = (new DOMParser)
.parseFromString(code, 'text/html');
const textNodeIterator =
document.createNodeIterator(
dom.documentElement,
NodeFilter.SHOW_TEXT,
node => (
(node.textContent.trim() !== '') && // - content other than just white space(s)
(/\s+/).test(node.textContent) && // - content with any kind of white space
!(/[\t\n]+/).test(node.textContent) // - content without tabs and new lines
)
? NodeFilter.FILTER_ACCEPT
: NodeFilter.FILTER_REJECT
);
const textContentList = [];
let textNode;
while (textNode = textNodeIterator.nextNode()) {
textContentList.push(textNode.textContent)
}
console.log({ textContentList });
<!-- language: lang-css -->
.as-console-wrapper { min-height: 100%!important; top: 0; }
<!-- end snippet -->
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论