2023年6月27日 20:53:43go评论103阅读模式

英文:

How does one extract any text-content from HTML-code which does contain whitespace but neither tab nor line-break?

问题

I can provide a translation for the text you've provided:

如何在任何 HTML 中查找并选择只包含空格但不包含制表符和换行符的文本，并且不选择标签本身。

从相反的方面来看，我成功了，但正如我上面所看到的 - 没有。

这是我得到的：

&lt;[^&gt;]+&gt;(.+?)&lt;\/[^&gt;]+&gt;

Please note that I've translated the text but omitted the code section, as per your request.

英文:

How to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves.

From the opposite, I succeeded, but as I looked above - no

&lt;html&gt;
&lt;body&gt;
&lt;h1&gt; text1&lt;/h1&gt;
&lt;p&gt;text2&lt;/p&gt;
text14
&lt;p&gt;   text3   &lt;/p&gt;
text2
&lt;/body&gt;
&lt;/html&gt;

This is what I got:

&lt;[^&gt;]+&gt;(.+?)&lt;\/[^&gt;]+&gt;

答案1

得分: 1

以下是翻译好的部分：

假设您想要

["text1", "text2", "text3"]

并且希望忽略带有制表符或换行符的节点

然后，您可以使用 parseFromString 和 createNodeIterator

并执行以下操作：

const htmlStr = `<html>
    <body>
      <h1> text1</h1>
      <p>text2</p>
      text14 is ignored due to newlines
      <p> text3 </p>
      text2
    </body>
    </html>`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, "text/html");
let currentNode,
  nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);
const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
  const text = currentNode.textContent;
  const textHasTabsOrNewlines = text.match(/[\t\n]/);
  console.log("text:>", currentNode.textContent, "<", textHasTabsOrNewlines)
  const textOnly = text.trim();
  if (textOnly !== "" && !textHasTabsOrNewlines) textArr.push(textOnly);
}
console.log(textArr);

希望这对您有帮助。

英文:

Assuming you wanted

[&quot;text1&quot;, &quot;text2&quot;, &quot;text3&quot;]

and wanted to ignore the nodes with tabs or newlines

then you can use parseFromString and createNodeIterator

and do this:

const htmlStr = `&lt;html&gt;
    &lt;body&gt;
      &lt;h1&gt; text1&lt;/h1&gt;
      &lt;p&gt;text2&lt;/p&gt;
      text14 is ignored due to newlines
      &lt;p&gt; text3 &lt;/p&gt;
      text2
    &lt;/body&gt;
    &lt;/html&gt;`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, &quot;text/html&quot;);
let currentNode,
  nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);
const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
  const text = currentNode.textContent;
  const textHasTabsOrNewlines = text.match(/[\t\n]/);
  console.log(&quot;text:&gt;&quot;, currentNode.textContent, &quot;&lt;&quot;, textHasTabsOrNewlines)
  const textOnly = text.trim();
  if (textOnly !== &quot;&quot; &amp;&amp; !textHasTabsOrNewlines) textArr.push(textOnly);
}
console.log(textArr);

答案2

得分: 1

以下是您要翻译的内容：

"The requirements as in the OP's own words ..."

> "how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"

The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser and NodeIterator based approach are capable of returning the OP's expected result.

But a NodeIterator instance with an additionally applied filter where the latter uses 2 regex pattern based tests does the job ...

const code =
<html> <body> <h1>foo</h1>  <p> bar </p>  baz  <p>bizz</p>  buzz  <p>booz </p>  </body> </html>;

const dom = (new DOMParser)
.parseFromString(code, 'text/html');

const textNodeIterator =
document.createNodeIterator(
dom.documentElement,
NodeFilter.SHOW_TEXT,
node => (
(node.textContent.trim() !== '') && // - content other than just white space(s)
(/\s+/).test(node.textContent) && // - content with any kind of white space
!(/[\t\n]+/).test(node.textContent) // - content without tabs and new lines
)
? NodeFilter.FILTER_ACCEPT
: NodeFilter.FILTER_REJECT
);

const textContentList = [];
let textNode;

while (textNode = textNodeIterator.nextNode()) {
textContentList.push(textNode.textContent)
}
console.log({ textContentList });

.as-console-wrapper { min-height: 100%!important; top: 0; }

英文:

The requirements as in the OP's own words ...

> "how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"

The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser and NodeIterator based approach are capable of returning the OP's expected result.

But a NodeIterator instance with an additionally applied filter where the latter uses 2 regex pattern based tests does the job ...

const code =
`&lt;html&gt;
  &lt;body&gt;
    &lt;h1&gt;foo&lt;/h1&gt;  &lt;!-- no pick ... not a single white space at all --&gt;
    &lt;p&gt;  bar &lt;/p&gt; &lt;!-- pick... ... simple spaces only --&gt;
	baz           &lt;!-- no pick ... leading tab and new line --&gt;
    &lt;p&gt;bizz&lt;/p&gt;   &lt;!-- no pick ... not a single white space at all --&gt;
    buzz          &lt;!-- no pick ... leading simple spaces and new line --&gt;
    &lt;p&gt;booz  &lt;/p&gt; &lt;!-- pick... ... simple spaces only --&gt;
  &lt;/body&gt;
&lt;/html&gt;`;
const dom = (new DOMParser)
  .parseFromString(code, &#39;text/html&#39;);
const textNodeIterator =
  document.createNodeIterator(
    dom.documentElement,
    NodeFilter.SHOW_TEXT,
    node =&gt; (
      (node.textContent.trim() !== &#39;&#39;) &amp;&amp; // - content other than just white space(s)
      (/\s+/).test(node.textContent) &amp;&amp;   // - content with any kind of white space
      !(/[\t\n]+/).test(node.textContent) // - content without tabs and new lines
    )
    ? NodeFilter.FILTER_ACCEPT
    : NodeFilter.FILTER_REJECT
  );
const textContentList = [];
let textNode;
while (textNode = textNodeIterator.nextNode()) {
  textContentList.push(textNode.textContent)
}
console.log({ textContentList });

.as-console-wrapper { min-height: 100%!important; top: 0; }

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从包含空格但不包含制表符或换行符的HTML代码中提取文本内容如何？

问题

答案1

答案2

为什么我不能在JavaScript中创建自己的自定义钩子？

async函数内是否需要有await，即使在调用该函数的地方有await？

如何在JavaScript中使用相同的属性ID时更新同一数组对象内的属性值。

jQuery下拉筛选仅返回一个结果

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论