创建一个对象数组,将文本和标记区分开。

huangapple go评论73阅读模式
英文:

Create an array of objects separating what is text and what is markup

问题

这是您提供的HTML代码,您希望将其转化为文本和标记的对象数组:

[
    {"text": "A "},
    {"markup": "<b>"},
    {"text": "test"},
    {"markup": "</b>"}
]

在您提供的代码中,您正在尝试解析HTML并将其转化为此格式。如果您有关于代码的具体问题或需要进一步的帮助,请告诉我。

英文:

From an HTML code, I've to make an array of objects separating what is text and what is markup, like this way:

[
	{&quot;text&quot;: &quot;A &quot;},
	{&quot;markup&quot;: &quot;&lt;b&gt;&quot;},
	{&quot;text&quot;: &quot;test&quot;},
	{&quot;markup&quot;: &quot;&lt;/b&gt;&quot;}
]

The HTML code that I'm using is this one:

&lt;h2 id=&quot;mcetoc_1h1m1ll27l&quot;&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit.&lt;/h2&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at tincidunt lectus.&lt;a href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;tr&lt;/a&gt;&lt;a title=&quot;titulo&quot; href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;adsf afjdasi k&lt;/a&gt;&lt;a title=&quot;titlee&quot; href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;asdsssssssssssss&lt;/a&gt;&lt;a href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;s&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;Lorem Ipsum&lt;/a&gt;&lt;/p&gt;

To avoid the using of RegEx, first I create an array with all the nodes and then I loop over the nodes, looking what is an Element and what is a text node.

Currently I'm stuck with closing tags when an element node has child text nodes followed element nodes (and I'm not sure if I'm overcomplicating things):

&lt;p&gt;Lorem ipsum dolor sit...&lt;a href=&quot;...&quot; aria-invalid=&quot;true&quot;&gt;tr&lt;/a&gt;&lt;a title=&quot;...&quot; href=&quot;...&quot; aria-invalid=&quot;true&quot;&gt;...&lt;/a&gt;&lt;a title=&quot;...&quot; href=&quot;...&quot; aria-invalid=&quot;true&quot;&gt;...&lt;/a&gt;&lt;a href=&quot;...&quot; aria-invalid=&quot;true&quot;&gt;...&lt;/a&gt;&lt;/p&gt;

So from this paragraph, my object looks so:

{markup: &#39;&lt;p&gt;&#39;}
{text: &#39;Lorem ipsum dolor sit...&#39;}
{markup: &#39;&lt;/p&gt;&#39;}
{markup: &#39;&lt;a&gt;&#39;}
//...

As you can see the close tag appears after the text node. I've managed it for element nodes followed by other element nodes, but this case still escapes me.

This is what I've done so far (codepen):

const obj = {
	annotation: []
};

const nodelist = (() =&gt; {
	const res = [];
	const tw = document.createTreeWalker(document.body);

	while (tw.nextNode()) {
		res.push(tw.currentNode)
	}

	return res;
})();

console.log(nodelist);

const nodeHasParents = (node) =&gt; node.parentNode.nodeName !== &#39;BODY&#39;;
const isTextNode = (node) =&gt; node.nodeType === Node.TEXT_NODE;
const isElementNode = (node) =&gt; node.nodeType === Node.ELEMENT_NODE;

const GetNextNodeElements = (i) =&gt; {
	let n = i + 1;
	let res = [];

	while (nodelist[n] &amp;&amp; isElementNode(nodelist[n])) {
		res.push(nodelist[n]);
		n++;
	}

	return res;
}

const GetNextTextNode = (i) =&gt; {
	let n = i + 1;

	for (let n = i; n &lt; nodelist.length; n++) {
		if (isTextNode(nodelist[n])) return nodelist[n];
	}
}


for (let i = 0; i &lt; nodelist.length; i++) {
	let node = nodelist[i];
	let opentags = &#39;&#39;;
	let closetags = &#39;&#39;;

	if (isTextNode(node) &amp;&amp; !nodeHasParents(node)) {
		obj.annotation.push({&quot;text&quot;: node.textContent});
	}
	else if (isElementNode(node)) {
		opentags += `&lt;${node.nodeName.toLowerCase()}&gt;`;

		const currentNode = node;
		const nextNodeElements = GetNextNodeElements(i);

		if (nextNodeElements) {
			nextNodeElements.forEach(node =&gt; opentags += node.outerHTML.replace(node.textContent, &#39;&#39;).replace(`&lt;/${node.nodeName.toLowerCase()}&gt;`, &#39;&#39;));
			nextNodeElements.reverse();
			nextNodeElements.forEach(node =&gt; closetags += `&lt;/${node.nodeName.toLowerCase()}&gt;`);

			i = i + nextNodeElements.length;
			node = nodelist[i];
		}

		if (!!closetags.length) {
			closetags = `&lt;/${currentNode.nodeName.toLowerCase()}&gt;` + closetags;
		}
		else closetags += `&lt;/${currentNode.nodeName.toLowerCase()}&gt;`

		obj.annotation.push({&quot;markup&quot;: opentags});
		obj.annotation.push({&quot;text&quot;: GetNextTextNode(i)?.textContent});
		obj.annotation.push({&quot;markup&quot;: closetags});
	}
}

console.log(obj.annotation);

答案1

得分: 1

递归使它变得更容易:

const tree = [];
walk(document.body);
console.log(tree);

function walk(parent) {
    for (const elem of parent.childNodes) {
        if (elem.nodeType === Node.TEXT_NODE) {
            tree.push({ text: elem.textContent });
        } else if (elem.nodeType === Node.ELEMENT_NODE) {
            tree.push({ markup: `<${elem.tagName.toLowerCase()}>` });
            elem.hasChildNodes() && walk(elem);
            tree.push({ markup: `</${elem.tagName.toLowerCase()}>` });
        }
    }
}
<h2 id="mcetoc_1h1m1ll27l">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at tincidunt lectus.<a href="https://www.sadasdas.es" aria-invalid="true">tr</a><a title="titulo" href="https://www.sadasdas.es" aria-invalid="true">adsf afjdasi k</a><a title="titlee" href="https://www.sadasdas.es" aria-invalid="true">asdsssssssssssss</a><a href="https://www.sadasdas.es" aria-invalid="true">s</a></p>
<p><a href="https://www.sadasdas.es" aria-invalid="true">Lorem Ipsum</a></p>
英文:

Recursion makes it easier:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const tree = []; 

walk(document.body);

console.log(tree);

function walk(parent) {
    for (const elem of parent.childNodes) {
        if(elem.nodeType === Node.TEXT_NODE){
            tree.push({text: elem.textContent});
        } else if(elem.nodeType === Node.ELEMENT_NODE){
            tree.push({markup: `&lt;${elem.tagName.toLowerCase()}&gt;`});
            elem.hasChildNodes() &amp;&amp; walk(elem);
            tree.push({markup: `&lt;/${elem.tagName.toLowerCase()}&gt;`});
        }
    }
}

<!-- language: lang-html -->

&lt;h2 id=&quot;mcetoc_1h1m1ll27l&quot;&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit.&lt;/h2&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at tincidunt lectus.&lt;a href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;tr&lt;/a&gt;&lt;a title=&quot;titulo&quot; href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;adsf afjdasi k&lt;/a&gt;&lt;a title=&quot;titlee&quot; href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;asdsssssssssssss&lt;/a&gt;&lt;a href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;s&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.sadasdas.es&quot; aria-invalid=&quot;true&quot;&gt;Lorem Ipsum&lt;/a&gt;&lt;/p&gt;

<!-- end snippet -->

答案2

得分: 0

你可以找到一个npm包来将xml转换为js对象或json。尝试在搜索引擎上查找以下内容:

  • XML转JS
  • XML转JS对象
  • XML转JSON
  • 解析XML为JS对象

幸运的是,我找到了一个有趣的库:xml-js。然后,如果你在浏览器上,你可以使用Cloudflare CDNjs、jsDelivr或Unpkg来获取这个库。你认为哪个是最好的?

在Stack Overflow上也有同样的问题。你可以进一步阅读:

但是,如果你坚持要自己做,你最终会涉及到编译技术,学习有限自动机、正则表达式、词法分析等等。

最后,你可以尝试将xml解析为DOM。但我认为这可能会很重。你可能会觉得这个链接有趣:https://www.w3schools.com/xml/dom_intro.asp

另外,请不要重新发明轮子。可能已经存在一些类似的库或项目,你可以尝试使用。

英文:

You could find an npm package to convert xml to js object or to json. Try to find this on search engine:

  • XML to JS
  • XML to JS Object
  • XML to JSON
  • Parse XML to JS Object

Fortunately, I found an interesting library: xml-js. Then, if you are on browser, you could fetch the library using Cloudflare CDNjs, jsDelivr, or Unpkg. Which one you think is the best.

And also there is same question in stackoverflow. You could read this further:

But, if you insist to do it by yourself, you would end up into Compilation Technique, and learning about finite automata, regular expression, lexical analysis, etc.

The last you could try is parse xml to dom. But I thought this would heavy. You may find this interesting: https://www.w3schools.com/xml/dom_intro.asp

Also, please, do not reinvent the wheel. There should exists some similar library or project you may try to achive.

huangapple
  • 本文由 发表于 2023年6月8日 14:46:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76429248.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定