英文:
Why does len on x/net/html Token().Attr return a non-zero value for an empty slice here?
问题
我正在使用Golang中内置的html
库。以下是重现问题的代码:
package main
import (
"fmt"
"log"
"net/http"
"golang.org/x/net/html"
)
const url = "https://google.com"
func main() {
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
log.Fatalf("状态码错误:%d %s", resp.StatusCode, resp.Status)
}
h := html.NewTokenizer(resp.Body)
for {
if h.Next() == html.ErrorToken {
break
}
l := len(h.Token().Attr)
if l != 0 {
fmt.Println("=======")
fmt.Println("长度", l) // 大于0
fmt.Println("属性", h.Token().Attr) // 每次都为空
}
}
}
以下是输出的样子:
=======
长度 2
属性 []
属性类型 []html.Attribute
=======
长度 8
属性 []
属性类型 []html.Attribute
=======
长度 1
属性 []
属性类型 []html.Attribute
=======
长度 1
属性 []
属性类型 []html.Attribute
Go为什么会认为h.Token().Attr
的长度在这里是非零的,而实际上h.Token().Attr
是空的?
附注:保存h.Token().Attr
的输出并将其用于len
和打印内容可以正常工作。
代码:
package main
import (
"fmt"
"log"
"net/http"
"golang.org/x/net/html"
)
const url = "https://google.com"
func main() {
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
log.Fatalf("状态码错误:%d %s", resp.StatusCode, resp.Status)
}
h := html.NewTokenizer(resp.Body)
for {
if h.Next() == html.ErrorToken {
break
}
attrs := h.Token().Attr // 在这里保存输出并在其他地方使用
l := len(attrs)
if l != 0 {
fmt.Println("=======")
fmt.Println("长度", l)
fmt.Println("属性", attrs)
}
}
}
输出:
长度 3
属性 [{ value AJiK0e8AAAAAYtZT7PXDBRBC2BJawIxezEfmIL6Aw5Uy} { name iflsig} { type hidden}]
=======
长度 4
属性 [{ class fl sblc} { align left} { nowrap } { width 25%}]
=======
长度 1
属性 [{ href /advanced_search?hl=en-IN&authuser=0}]
=======
长度 4
属性 [{ id gbv} { name gbv} { type hidden} { value 1}]
英文:
I am using the built-in html
library in Golang.
Here's the code to reproduce the issue:
package main
import (
"fmt"
"log"
"net/http"
"golang.org/x/net/html"
)
const url = "https://google.com"
func main() {
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
}
h := html.NewTokenizer(resp.Body)
for {
if h.Next() == html.ErrorToken {
break
}
l := len(h.Token().Attr)
if l != 0 {
fmt.Println("=======")
fmt.Println("Length", l) // greater than 0
fmt.Println("Attr", h.Token().Attr) // empty all the times
}
}
}
Here's what the output looks like
=======
Length 2
Attr []
typeof Attr []html.Attribute
=======
Length 8
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute
go version
go version go1.17.7 linux/amd64
Why does Go think the length of h.Token().Attr
is non-zero here when the h.Token().Attr
is empty?
P.S.: saving the output of h.Token().Attr
and using it for len
and printing the contents makes everything work fine
Code:
package main
import (
"fmt"
"log"
"net/http"
"golang.org/x/net/html"
)
const url = "https://google.com"
func main() {
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
}
h := html.NewTokenizer(resp.Body)
for {
if h.Next() == html.ErrorToken {
break
}
attrs := h.Token().Attr // save the output here and use it everywhere else
l := len(attrs)
if l != 0 {
fmt.Println("=======")
fmt.Println("Length", l)
fmt.Println("Attr", attrs)
}
}
}
Output
Length 3
Attr [{ value AJiK0e8AAAAAYtZT7PXDBRBC2BJawIxezEfmIL6Aw5Uy} { name iflsig} { type hidden}]
=======
Length 4
Attr [{ class fl sblc} { align left} { nowrap } { width 25%}]
=======
Length 1
Attr [{ href /advanced_search?hl=en-IN&authuser=0}]
=======
Length 4
Attr [{ id gbv} { name gbv} { type hidden} { value 1}]
答案1
得分: 6
分词器有一种有趣的接口,你不能在调用Next()
之间多次调用Token()
。正如文档所说:
在EBNF表示法中,每个标记的有效调用顺序是:
Next {Raw} [ Token | Text | TagName {TagAttr} ]
也就是说:在调用Next()
之后,你可以零次或多次调用Raw()
;然后你可以选择:
- 调用一次
Token()
, - 调用一次
Text()
, - 调用一次
TagName()
,然后零次或多次调用TagAttr()
(可能是因为你不关心属性而根本不调用,或者调用足够次数以检索所有属性)。 - 或者什么都不做(可能是跳过标记)。
按照顺序调用这些方法的结果是未定义的,因为这些方法修改了内部状态,它们不是纯访问器。在你的第一个片段中,在调用Next()
之间多次调用了Token()
,所以结果是无效的。所有的属性都被第一次调用消耗掉,并且后面的调用不会返回这些属性。
英文:
Tokenizer has a kind of funny interface, and you aren't allowed to call Token()
more than once between calls to Next()
. As the doc says:
> In EBNF notation, the valid call sequence per token is:
> Next {Raw} [ Token | Text | TagName {TagAttr} ]
Which is to say: after calling Next()
you may call Raw()
zero or more times; then you can either:
- Call
Token()
once, - Call
Text()
once, - Call
TagName()
once followed byTagAttr()
zero or more times (presumably, either not at all because you don't care about the attributes, or enough times to retrieve all of the attributes). - Or do nothing (maybe you're skipping tokens).
The results of calling things out of sequence are undefined, because the methods modify internal state — they're not pure accessors. In your first snippet you call Token()
multiple times between calls to Next()
, so the result is invalid. All of the attributes are consumed by the first call, and aren't returned by the later ones.
答案2
得分: 0
(*Tokenizer).Token()
每次返回一个新的 Token,其中包含一个新的 []Attr。在 .Token()
方法中,下一次调用的 tokenizer 在 1145 行上的起始和结束数字是相同的,所以它不会进入 这个循环,因此下一次的 Attr 将为空。
英文:
The (*Tokenizer).Token()
returns a new Token everytime which has a new []Attr again, In the .Token()
here the tokenizer in the next call has the start and end are the same number on line 1145 there, so it doesn't go in this loop, so the Attr will be empty next time.
答案3
得分: -1
这是要翻译的内容:
它不是空的,你只需要循环遍历它并查看值。
package main
import (
"fmt"
"strings"
"golang.org/x/net/html"
)
func main() {
body := `
<html lang="en">
<body onload="fool()">
</body>
</html>
`
h := html.NewTokenizer(strings.NewReader(body))
for {
if h.Next() == html.ErrorToken {
break
}
attr := h.Token().Attr
l := len(attr)
if l != 0 {
fmt.Println("=======")
fmt.Println("Length", l) // 大于0
for i, a := range attr {
fmt.Printf("Attr %d %v\n", i, a)
}
}
}
}
Playground: https://go.dev/play/p/lzEdppsURl0
英文:
It's not empty, you just need to loop over it and view the values.
package main
import (
"fmt"
"strings"
"golang.org/x/net/html"
)
func main() {
body := `
<html lang="en">
<body onload="fool()">
</body>
</html>
`
h := html.NewTokenizer(strings.NewReader(body))
for {
if h.Next() == html.ErrorToken {
break
}
attr := h.Token().Attr
l := len(attr)
if l != 0 {
fmt.Println("=======")
fmt.Println("Length", l) // greater than 0
for i, a := range attr {
fmt.Printf("Attr %d %v\n", i, a)
}
}
}
}
Playground: https://go.dev/play/p/lzEdppsURl0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论