意外的HTML标记来自html.NewTokenizer.Token()。

huangapple go评论80阅读模式
英文:

Unexpected HTML token from html.NewTokenizer.Token()

问题

我正在尝试列出网页中找到的所有标记。核心部分在以下函数中:

func find_links(httpBody io.Reader) []string {
	links := make([]string, 0)
	page := html.NewTokenizer(httpBody)
	for {
		tokenType := page.Next()
		if tokenType == html.ErrorToken {
			return links
		}
		token := page.Token()
		fmt.Println("Now token is ", token)
	}
}

当我打印输出时,我得到类似以下的结果:

Now token is <body>
Now token is

Now token is <header>

我不明白第二个标记是什么,为什么会打印出额外的空行。

完整的可运行示例代码在这里,尽管由于缺少http包,它无法在playground上运行。

英文:

I am trying to list all the tokens found in a web page. The core is in the function

func find_links(httpBody io.Reader) []string {

	links := make([]string, 0)
	page := html.NewTokenizer(httpBody)
	for {
		tokenType := page.Next()
		if tokenType == html.ErrorToken {
			return links
		}
		token := page.Token()
		fmt.Println(&quot;Now token is &quot;, token)
	}
}

When I print the output I obtain something like

Now token is  &lt;body&gt;
Now token is

Now token is  &lt;header&gt;

I don't understand what the second token is and why it is printing an extra blank line.

Full code of a working example here, even if it can't run on playground because of the missing http package

答案1

得分: 1

第二个标记是一个包含换行符的TextToken

将打印语句更改为

   fmt.Printf("现在的标记是 %T %v\n", token, token)

以查看标记的类型。

英文:

The second token is a TextToken containing a newline.

Change the print to

   fmt.Printf(&quot;Now token is %T %v\n&quot;, token, token)

to see the types of the tokens.

huangapple
  • 本文由 发表于 2014年10月1日 07:45:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/26132041.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定