统计字符串中开闭的HTML标签数量。

huangapple go评论73阅读模式
英文:

finding the number of open closed html tags in a string

问题

我正在努力找出在字符串中找到有效的 HTML 标签数量的最佳方法。

假设标签仅在有开放和关闭标签的情况下才视为有效

这是一个测试案例的示例:

输入

"html": "<html><head></head><body><div><div></div></div>"

输出

"validTags": 3
英文:

I trying to figure out the best way to find the number of valid HTML tags in a string.

The assumption is that the tag is valid only if it has an opening and closing tag

this is an example of a test case

INPUT

&quot;html&quot;: &quot;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div&gt;&lt;div&gt;&lt;/div&gt;&lt;/div&gt;&quot;

Output

&quot;validTags&quot;:3

答案1

得分: 2

如果你需要解析HTML

不要自己动手。 没有必要重新发明轮子。有大量用于解析HTML的库。使用适当的工具来完成适当的工作。

将你的精力集中在项目的其他部分上。当然,你可以实现自己的函数来解析字符串,寻找 &lt;&gt;,然后执行适当的操作。但是HTML可能会比你想象的稍微复杂一些,或者你可能最终需要比仅仅计算标签数目更多的HTML解析功能。

也许在将来你会想要将 &lt;br/&gt;&lt;br /&gt; 也一并计数。或者你可能想要找出HTML树的深度。

也许你自己编写的代码没有考虑到所有可能的转义字符组合、嵌套标签等等。在下面这个字符串中有多少个正确的标签:
&lt;a&gt;&lt;b&gt;&lt;c&gt;&lt;d e&gt;&lt;f g=&quot;&lt;h&gt;&lt;/h&gt;&quot;&gt;&lt;i j=&quot;&lt;k&gt;&quot; l=&quot;&lt;/k&gt;&quot;&gt;&lt;/i&gt;&lt;/f&gt;&lt;/e d&gt;&lt;/b&gt;&lt;/c&gt;&lt;/ a &gt;

在一条评论中,用户 dbl 提供了一个类似的问题,附带了到库的链接:如何从Java验证HTML?

如果你想要将开闭标签对作为学习项目进行计数

这里有一个提议的伪代码算法,作为递归函数:

function count_tags(s):
  tag, remainder = find_next_tag(s)
  found, inside, after = find_closing_tag(tag, remainder)
  if (found)
    return 1 + count_tags(inside) + count_tags(after)
  else
    return count_tags(inside)

示例

  • 对于字符串 hello &lt;a&gt;world&lt;c&gt;&lt;/c&gt;&lt;/a&gt;&lt;b&gt;&lt;/b&gt;,我们会得到:
tag = &quot;&lt;a&gt;&quot;
remainder = &quot;world&lt;c&gt;&lt;/c&gt;&lt;/a&gt;&lt;b&gt;&lt;/b&gt;&quot;
found = true
inside = &quot;world&lt;c&gt;&lt;/c&gt;&quot;
after = &quot;&lt;b&gt;&lt;/b&gt;&quot;
return 1 + count_tags(&quot;world&lt;c&gt;&lt;/c&gt;&quot;) + count_tags(&quot;&lt;b&gt;&lt;/b&gt;&quot;)
  • 对于字符串 &lt;html&gt;&lt;head&gt;&lt;/head&gt;
tag = &quot;&lt;html&gt;&quot;
remainder = &quot;&lt;head&gt;&lt;/head&gt;&quot;
found = false
inside = &quot;&lt;head&gt;&lt;/head&gt;&quot;
after = &quot;&quot;
return count_tags(&quot;&lt;head&gt;&lt;/head&gt;&quot;)
  • 对于字符串 &lt;a&gt;&lt;b&gt;&lt;/a&gt;&lt;/b&gt;
tag = &quot;&lt;a&gt;&quot;
remainder = &quot;&lt;b&gt;&lt;/a&gt;&lt;/b&gt;&quot;
found = true
inside = &quot;&lt;b&gt;&quot;
after = &quot;&lt;/b&gt;&quot;
return 1 + count_tags(&quot;&lt;b&gt;&quot;) + count_tags(&quot;&lt;/b&gt;&quot;)
英文:

If you need to parse HTML

Do not do it yourself. There is no need to reinvent the wheel. There is a plethora of libraries for parsing HTML. Use the proper tool for the proper job.

Concentrate your efforts on the rest of your project. Sure, you could implement your own function that parses a string, looks for &lt; and &gt;, and acts appropriately. But HTML might be slightly more complex than you imagine, or you might end up needing more HTML parsing than just counting tags.

Maybe in the future you'llwant to count &lt;br/&gt; and &lt;br /&gt; as well. Or you'll want to find the depth of the HTML tree.

Maybe your homemade code doesn't account for all possible combinations of escaping characters, nested tags, etc. How many correct tags are there in the string:
&lt;a&gt;&lt;b&gt;&lt;c&gt;&lt;d e&gt;&lt;f g=&quot;&lt;h&gt;&lt;/h&gt;&quot;&gt;&lt;i j=&quot;&lt;k&gt;&quot; l=&quot;&lt;/k&gt;&quot;&gt;&lt;/i&gt;&lt;/f&gt;&lt;/e d&gt;&lt;/b&gt;&lt;/c&gt;&lt;/ a &gt;

In a comment, user dbl linked to a similar question with links to libraries: How to validate HTML from java ?

If you want to count open-closed tag pairs as a learning project

Here is a proposed algorithm in pseudocode, as a recursive function:

function count_tags(s):
  tag, remainder = find_next_tag(s)
  found, inside, after = find_closing_tag(tag, remainder)
  if (found)
    return 1 + count_tags(inside) + count_tags(after)
  else
    return count_tags(inside)

Examples

  • on the string hello &lt;a&gt;world&lt;c&gt;&lt;/c&gt;&lt;/a&gt;&lt;b&gt;&lt;/b&gt;, we will get:
tag = &quot;&lt;a&gt;&quot;
remainder = &quot;world&lt;c&gt;&lt;/c&gt;&lt;/a&gt;&lt;b&gt;&lt;/b&gt;&quot;
found = true
inside = &quot;world&lt;c&gt;&lt;/c&gt;&quot;
after = &quot;&lt;b&gt;&lt;/b&gt;&quot;
return 1 + count_tags(&quot;world&lt;c&gt;&lt;/c&gt;&quot;) + count_tags(&quot;&lt;b&gt;&lt;/b&gt;&quot;)
  • on the string &lt;html&gt;&lt;head&gt;&lt;/head&gt;:
tag = &quot;&lt;html&gt;&quot;
remainder = &quot;&lt;head&gt;&lt;/head&gt;&quot;
found = false
inside = &quot;&lt;head&gt;&lt;/head&gt;&quot;
after = &quot;&quot;
return count_tags(&quot;&lt;head&gt;&lt;/head&gt;&quot;)
  • on the string &lt;a&gt;&lt;b&gt;&lt;/a&gt;&lt;/b&gt;:
tag = &quot;&lt;a&gt;&quot;
remainder = &quot;&lt;b&gt;&lt;/a&gt;&lt;/b&gt;&quot;
found = true
inside = &quot;&lt;b&gt;&quot;
after = &quot;&lt;/b&gt;&quot;
return 1 + count_tags(&quot;&lt;b&gt;&quot;) + count_tags(&quot;&lt;/b&gt;&quot;)

答案2

得分: 0

以下是翻译好的代码部分:

static int checkValidTags(String html, String[] openTags, String[] closeTags) {
    // openTags和closeTags必须具有相同的长度;
    // 此函数跟踪所有开放标签。
    // 如果标签正确关闭,则删除打开和关闭标签
    // 它甚至可以检测标签上添加的标签。
    HashMap<Character, Integer> open = new HashMap<>();
    HashMap<Character, Integer> close = new HashMap<>();

    // 使用一个起始字符,这是1,因为0将是字符串终止符。
    int startChar = 1;
    for (int i = 0; i < openTags.length; i++) {
        open.put((char) startChar, i);
        close.put((char) (startChar + 1), i);
        html = html.replaceAll(openTags[i], "" + (char) startChar);
        html = html.replaceAll(closeTags[i], "" + (char) (startChar + 1));
        startChar += 2;
    }
    List<List<Integer>> startIndexes = new ArrayList<>();
    int validLabels = 0;
    for (int i = 0; i < openTags.length; i++) {
        startIndexes.add(new ArrayList<>());
    }
    for (int i = 0; i < html.length(); i++) {
        char c = html.charAt(i);
        if (open.get(c) != null) {
            startIndexes.get(open.get(c)).add(0, i);
        }
        if (close.get(c) != null && !startIndexes.get(close.get(c)).isEmpty()) {
            String closed = html.substring(startIndexes.get(close.get(c)).get(0), i);
            for (int k = 0; k < startIndexes.size(); k++) {
                if (!startIndexes.get(k).isEmpty()) {
                    int p = startIndexes.get(k).get(0);
                    if (p > startIndexes.get(close.get(c)).get(0)) {
                        startIndexes.get(k).remove(0);
                    }
                }
            }
            startIndexes.get(close.get(c)).remove(0);
            html.replace(closed, "");
            validLabels++;
        }
    }
    return validLabels;
}

// 使用示例
String html = "<html><head></head><body><div><div></div></div>";
int validTags = checkValidTags(html, new String[] {
    // 在此处添加您要查找的所有标签。
    // 删除尾随的'>'以便它可以检测到附加的额外标签
    "<html","<head","<body","<div"
}, new String[]{
    "</html>", "</head>", "</body>", "</div>"
});

System.out.println(validTags);

输出:

3
英文:

I wrote a function that would do exactly this.

static int checkValidTags(String html,String[] openTags, String[] closeTags) {
//openTags and closeTags must have the same length;
//This function keeps track of all opening tags.
//and removes the opening and closing tags if the tag is closed correctly
//It can even detect when there are labels added to the tags.
HashMap&lt;Character,Integer&gt; open = new HashMap&lt;&gt;();
HashMap&lt;Character,Integer&gt; close = new HashMap&lt;&gt;();
//Use a start character, this is 1 because 0 would be a string terminator.
int startChar = 1;
for(int i = 0; i &lt; openTags.length; i++) {
open.put((char)startChar, i);
close.put((char)(startChar+1), i);
html = html.replaceAll(openTags[i],&quot;&quot;+ (char)startChar);
html = html.replaceAll(closeTags[i],&quot;&quot;+(char)(startChar+1));
startChar+=2;
}
List&lt;List&lt;Integer&gt;&gt; startIndexes = new ArrayList&lt;&gt;();
int validLabels = 0;
for(int i = 0; i &lt; openTags.length; i++) {
startIndexes.add(new ArrayList&lt;&gt;());
}
for(int i = 0; i &lt; html.length(); i++) {
char c = html.charAt(i);
if(open.get(c)!=null) {
startIndexes.get(open.get(c)).add(0,i);
}
if(close.get(c)!=null&amp;&amp;!startIndexes.get(close.get(c)).isEmpty()) {
String closed = html.substring(startIndexes.get(close.get(c)).get(0),i);
for(int k = 0; k &lt; startIndexes.size(); k++) {
if(!startIndexes.get(k).isEmpty()) {
int p = startIndexes.get(k).get(0);
if(p &gt; startIndexes.get(close.get(c)).get(0)) {
startIndexes.get(k).remove(0);
}
}
}
startIndexes.get(close.get(c)).remove(0);
html.replace(closed, &quot;&quot;);
validLabels++;
}
}
return validLabels;
}

And to use it in your example you would do like this:

    String html = &quot;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div&gt;&lt;div&gt;&lt;/div&gt;&lt;/div&gt;&quot;;
int validTags = checkValidTags(html,new String[] {
//Add here all the tags you are looking for.
//Remove the trailing &#39;&gt;&#39; so it can detect extra tags appended to it
&quot;&lt;html&quot;,&quot;&lt;head&quot;,&quot;&lt;body&quot;,&quot;&lt;div&quot;
}, new String[]{
&quot;&lt;/html&gt;&quot;,&quot;&lt;/head&gt;&quot;,&quot;&lt;/body&gt;&quot;,&quot;&lt;/div&gt;&quot;
});
System.out.println(validTags);

Output:

3

huangapple
  • 本文由 发表于 2020年8月28日 19:20:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/63632799.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定