从包含空格但不包含制表符或换行符的HTML代码中提取文本内容如何?

huangapple go评论76阅读模式
英文:

How does one extract any text-content from HTML-code which does contain whitespace but neither tab nor line-break?

问题

I can provide a translation for the text you've provided:

如何在任何 HTML 中查找并选择只包含空格但不包含制表符和换行符的文本,并且不选择标签本身。

从相反的方面来看,我成功了,但正如我上面所看到的 - 没有。

这是我得到的:

<[^>]+>(.+?)<\/[^>]+>

Please note that I've translated the text but omitted the code section, as per your request.

英文:

How to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves.

From the opposite, I succeeded, but as I looked above - no

<html>
<body>
<h1> text1</h1>
<p>text2</p>
text14
<p>   text3   </p>
text2
</body>
</html>

This is what I got:

<[^>]+>(.+?)<\/[^>]+>

答案1

得分: 1

以下是翻译好的部分:

假设您想要

["text1", "text2", "text3"]

并且希望忽略带有制表符或换行符的节点

然后,您可以使用 parseFromStringcreateNodeIterator

并执行以下操作:

const htmlStr = `<html>
    <body>
      <h1> text1</h1>
      <p>text2</p>
      text14 is ignored due to newlines
      <p> text3 </p>
      text2
    </body>
    </html>`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, "text/html");

let currentNode,
  nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);

const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
  const text = currentNode.textContent;
  const textHasTabsOrNewlines = text.match(/[\t\n]/);
  console.log("text:>", currentNode.textContent, "<", textHasTabsOrNewlines)
  const textOnly = text.trim();
  if (textOnly !== "" && !textHasTabsOrNewlines) textArr.push(textOnly);
}
console.log(textArr);

希望这对您有帮助。

英文:

Assuming you wanted

[&quot;text1&quot;, &quot;text2&quot;, &quot;text3&quot;]

and wanted to ignore the nodes with tabs or newlines

then you can use parseFromString and createNodeIterator

and do this:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const htmlStr = `&lt;html&gt;
    &lt;body&gt;
      &lt;h1&gt; text1&lt;/h1&gt;
      &lt;p&gt;text2&lt;/p&gt;
      text14 is ignored due to newlines
      &lt;p&gt; text3 &lt;/p&gt;
      text2
    &lt;/body&gt;
    &lt;/html&gt;`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, &quot;text/html&quot;);

let currentNode,
  nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);

const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
  const text = currentNode.textContent;
  const textHasTabsOrNewlines = text.match(/[\t\n]/);
  console.log(&quot;text:&gt;&quot;, currentNode.textContent, &quot;&lt;&quot;, textHasTabsOrNewlines)
  const textOnly = text.trim();
  if (textOnly !== &quot;&quot; &amp;&amp; !textHasTabsOrNewlines) textArr.push(textOnly);
}
console.log(textArr);

<!-- end snippet -->

答案2

得分: 1

以下是您要翻译的内容:

"The requirements as in the OP's own words ..."

> "how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"

The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser and NodeIterator based approach are capable of returning the OP's expected result.

But a NodeIterator instance with an additionally applied filter where the latter uses 2 regex pattern based tests does the job ...

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const code =
&lt;html&gt; &lt;body&gt; &lt;h1&gt;foo&lt;/h1&gt; &lt;!-- no pick ... not a single white space at all --&gt; &lt;p&gt; bar &lt;/p&gt; &lt;!-- pick... ... simple spaces only --&gt; baz &lt;!-- no pick ... leading tab and new line --&gt; &lt;p&gt;bizz&lt;/p&gt; &lt;!-- no pick ... not a single white space at all --&gt; buzz &lt;!-- no pick ... leading simple spaces and new line --&gt; &lt;p&gt;booz &lt;/p&gt; &lt;!-- pick... ... simple spaces only --&gt; &lt;/body&gt; &lt;/html&gt;;

const dom = (new DOMParser)
.parseFromString(code, 'text/html');

const textNodeIterator =
document.createNodeIterator(
dom.documentElement,
NodeFilter.SHOW_TEXT,
node => (
(node.textContent.trim() !== '') && // - content other than just white space(s)
(/\s+/).test(node.textContent) && // - content with any kind of white space
!(/[\t\n]+/).test(node.textContent) // - content without tabs and new lines
)
? NodeFilter.FILTER_ACCEPT
: NodeFilter.FILTER_REJECT
);

const textContentList = [];
let textNode;

while (textNode = textNodeIterator.nextNode()) {
textContentList.push(textNode.textContent)
}
console.log({ textContentList });

<!-- language: lang-css -->

.as-console-wrapper { min-height: 100%!important; top: 0; }

<!-- end snippet -->

英文:

The requirements as in the OP's own words ...

> "how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"

The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser and NodeIterator based approach are capable of returning the OP's expected result.

But a NodeIterator instance with an additionally applied filter where the latter uses 2 regex pattern based tests does the job ...

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const code =
`&lt;html&gt;
  &lt;body&gt;
    &lt;h1&gt;foo&lt;/h1&gt;  &lt;!-- no pick ... not a single white space at all --&gt;
    &lt;p&gt;  bar &lt;/p&gt; &lt;!-- pick... ... simple spaces only --&gt;
	baz           &lt;!-- no pick ... leading tab and new line --&gt;
    &lt;p&gt;bizz&lt;/p&gt;   &lt;!-- no pick ... not a single white space at all --&gt;
    buzz          &lt;!-- no pick ... leading simple spaces and new line --&gt;
    &lt;p&gt;booz  &lt;/p&gt; &lt;!-- pick... ... simple spaces only --&gt;
  &lt;/body&gt;
&lt;/html&gt;`;

const dom = (new DOMParser)
  .parseFromString(code, &#39;text/html&#39;);

const textNodeIterator =
  document.createNodeIterator(
    dom.documentElement,
    NodeFilter.SHOW_TEXT,
    node =&gt; (
      (node.textContent.trim() !== &#39;&#39;) &amp;&amp; // - content other than just white space(s)
      (/\s+/).test(node.textContent) &amp;&amp;   // - content with any kind of white space
      !(/[\t\n]+/).test(node.textContent) // - content without tabs and new lines
    )
    ? NodeFilter.FILTER_ACCEPT
    : NodeFilter.FILTER_REJECT
  );

const textContentList = [];
let textNode;

while (textNode = textNodeIterator.nextNode()) {
  textContentList.push(textNode.textContent)
}
console.log({ textContentList });

<!-- language: lang-css -->

.as-console-wrapper { min-height: 100%!important; top: 0; }

<!-- end snippet -->

huangapple
  • 本文由 发表于 2023年6月27日 20:53:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76565071.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定