Why does len on x/net/html Token().Attr return a non-zero value for an empty slice here?

huangapple go评论79阅读模式
英文:

Why does len on x/net/html Token().Attr return a non-zero value for an empty slice here?

问题

我正在使用Golang中内置的html库。以下是重现问题的代码:

package main

import (
	"fmt"
	"log"
	"net/http"

	"golang.org/x/net/html"
)

const url = "https://google.com"

func main() {
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != 200 {
		log.Fatalf("状态码错误:%d %s", resp.StatusCode, resp.Status)
	}

	h := html.NewTokenizer(resp.Body)

	for {
		if h.Next() == html.ErrorToken {
			break
		}

		l := len(h.Token().Attr)

		if l != 0 {
			fmt.Println("=======")
			fmt.Println("长度", l) // 大于0
			fmt.Println("属性", h.Token().Attr) // 每次都为空
		}
	}
}

以下是输出的样子:

=======
长度 2
属性 []
属性类型 []html.Attribute
=======
长度 8
属性 []
属性类型 []html.Attribute
=======
长度 1
属性 []
属性类型 []html.Attribute
=======
长度 1
属性 []
属性类型 []html.Attribute

Go为什么会认为h.Token().Attr的长度在这里是非零的,而实际上h.Token().Attr是空的?

附注:保存h.Token().Attr的输出并将其用于len和打印内容可以正常工作。

代码:

package main

import (
	"fmt"
	"log"
	"net/http"

	"golang.org/x/net/html"
)

const url = "https://google.com"

func main() {
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != 200 {
		log.Fatalf("状态码错误:%d %s", resp.StatusCode, resp.Status)
	}

	h := html.NewTokenizer(resp.Body)

	for {
		if h.Next() == html.ErrorToken {
			break
		}

		attrs := h.Token().Attr // 在这里保存输出并在其他地方使用
		l := len(attrs)

		if l != 0 {
			fmt.Println("=======")
			fmt.Println("长度", l)
			fmt.Println("属性", attrs)
		}
	}
}

输出:

长度 3
属性 [{ value AJiK0e8AAAAAYtZT7PXDBRBC2BJawIxezEfmIL6Aw5Uy} { name iflsig} { type hidden}]
=======
长度 4
属性 [{ class fl sblc} { align left} { nowrap } { width 25%}]
=======
长度 1
属性 [{ href /advanced_search?hl=en-IN&authuser=0}]
=======
长度 4
属性 [{ id gbv} { name gbv} { type hidden} { value 1}]
英文:

I am using the built-in html library in Golang.
Here's the code to reproduce the issue:

package main

import (
	"fmt"
	"log"
	"net/http"

	"golang.org/x/net/html"
)

const url = "https://google.com"

func main() {
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != 200 {
		log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
	}

	h := html.NewTokenizer(resp.Body)

	for {
		if h.Next() == html.ErrorToken {
			break
		}

		l := len(h.Token().Attr)

		if l != 0 {
			fmt.Println("=======")
			fmt.Println("Length", l) // greater than 0
			fmt.Println("Attr", h.Token().Attr) // empty all the times
		}
	}
}

Here's what the output looks like

=======
Length 2
Attr []
typeof Attr []html.Attribute
=======
Length 8
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute

go version

go version go1.17.7 linux/amd64

Why does Go think the length of h.Token().Attr is non-zero here when the h.Token().Attr is empty?

P.S.: saving the output of h.Token().Attr and using it for len and printing the contents makes everything work fine

Code:

package main

import (
	"fmt"
	"log"
	"net/http"

	"golang.org/x/net/html"
)

const url = "https://google.com"

func main() {
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != 200 {
		log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
	}

	h := html.NewTokenizer(resp.Body)

	for {
		if h.Next() == html.ErrorToken {
			break
		}

		attrs := h.Token().Attr // save the output here and use it everywhere else
		l := len(attrs)

		if l != 0 {
			fmt.Println("=======")
			fmt.Println("Length", l)
			fmt.Println("Attr", attrs)
		}
	}
}

Output

Length 3
Attr [{ value AJiK0e8AAAAAYtZT7PXDBRBC2BJawIxezEfmIL6Aw5Uy} { name iflsig} { type hidden}]
=======
Length 4
Attr [{ class fl sblc} { align left} { nowrap } { width 25%}]
=======
Length 1
Attr [{ href /advanced_search?hl=en-IN&authuser=0}]
=======
Length 4
Attr [{ id gbv} { name gbv} { type hidden} { value 1}]

答案1

得分: 6

分词器有一种有趣的接口,你不能在调用Next()之间多次调用Token()。正如文档所说:

在EBNF表示法中,每个标记的有效调用顺序是:
Next {Raw} [ Token | Text | TagName {TagAttr} ]

也就是说:在调用Next()之后,你可以零次或多次调用Raw();然后你可以选择:

  • 调用一次Token()
  • 调用一次Text()
  • 调用一次TagName(),然后零次或多次调用TagAttr()(可能是因为你不关心属性而根本不调用,或者调用足够次数以检索所有属性)。
  • 或者什么都不做(可能是跳过标记)。

按照顺序调用这些方法的结果是未定义的,因为这些方法修改了内部状态,它们不是纯访问器。在你的第一个片段中,在调用Next()之间多次调用了Token(),所以结果是无效的。所有的属性都被第一次调用消耗掉,并且后面的调用不会返回这些属性。

英文:

Tokenizer has a kind of funny interface, and you aren't allowed to call Token() more than once between calls to Next(). As the doc says:

> In EBNF notation, the valid call sequence per token is:
> Next {Raw} [ Token | Text | TagName {TagAttr} ]

Which is to say: after calling Next() you may call Raw() zero or more times; then you can either:

  • Call Token() once,
  • Call Text() once,
  • Call TagName() once followed by TagAttr() zero or more times (presumably, either not at all because you don't care about the attributes, or enough times to retrieve all of the attributes).
  • Or do nothing (maybe you're skipping tokens).

The results of calling things out of sequence are undefined, because the methods modify internal state — they're not pure accessors. In your first snippet you call Token() multiple times between calls to Next(), so the result is invalid. All of the attributes are consumed by the first call, and aren't returned by the later ones.

答案2

得分: 0

(*Tokenizer).Token() 每次返回一个新的 Token,其中包含一个新的 []Attr。在 .Token() 方法中,下一次调用的 tokenizer 在 1145 行上的起始和结束数字是相同的,所以它不会进入 这个循环,因此下一次的 Attr 将为空。

英文:

The (*Tokenizer).Token() returns a new Token everytime which has a new []Attr again, In the .Token() here the tokenizer in the next call has the start and end are the same number on line 1145 there, so it doesn't go in this loop, so the Attr will be empty next time.

答案3

得分: -1

这是要翻译的内容:

它不是空的,你只需要循环遍历它并查看值。

package main

import (
	"fmt"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	body := `
<html lang="en">
<body onload="fool()">
</body>
</html>
`
	h := html.NewTokenizer(strings.NewReader(body))

	for {
		if h.Next() == html.ErrorToken {
			break
		}

		attr := h.Token().Attr
		l := len(attr)

		if l != 0 {
			fmt.Println("=======")
			fmt.Println("Length", l) // 大于0
			for i, a := range attr {
				fmt.Printf("Attr %d %v\n", i, a)
			}
		}
	}
}

Playground: https://go.dev/play/p/lzEdppsURl0

英文:

It's not empty, you just need to loop over it and view the values.

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;

	&quot;golang.org/x/net/html&quot;
)

func main() {
	body := `
&lt;html lang=&quot;en&quot;&gt;
&lt;body onload=&quot;fool()&quot;&gt;
&lt;/body&gt;
&lt;/html&gt;
`
	h := html.NewTokenizer(strings.NewReader(body))

	for {
		if h.Next() == html.ErrorToken {
			break
		}

		attr := h.Token().Attr
		l := len(attr)

		if l != 0 {
			fmt.Println(&quot;=======&quot;)
			fmt.Println(&quot;Length&quot;, l) // greater than 0
			for i, a := range attr {
				fmt.Printf(&quot;Attr %d %v\n&quot;, i, a)
			}
		}
	}
}

Playground: https://go.dev/play/p/lzEdppsURl0

huangapple
  • 本文由 发表于 2022年7月19日 13:40:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/73031647.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定