使用Go语言如何抓取HTML下拉列表?

huangapple go评论89阅读模式
英文:

Scrape HTML drop-down lists with Go?

问题

我正在使用golang.org/x/net/html包来从HTML页面中提取数据,目前一切都很顺利。然而,我不知道如何从下拉列表中提取数据,就像这样:

<!DOCTYPE html>
<html>
<body>

<select name="car" size="1" id="car">
  <option value="volvo">Volvo</option>
  <option value="saab">Saab</option>
  <option value="vw">VW</option>
  <option value="audi" selected>Audi</option>
</select>

<select name="animal" size="1" id="animal">
  <option value="dog">Dog</option>
  <option value="cat" selected>Cat</option>
  <option value="badger">Badger</option>
  <option value="mouse">Mouse</option>
</select>


我想提取预选的选项,所以结果应该是这样的:

car = audi
animal = cat

我该如何实现这个目标?如果golang.org/x/net/html无法满足我的需求,我还能做些什么来提取数据?

英文:

I'm using the package golang.org/x/net/html to scrape data out of HTML pages and this has been working fine so far. However, I don't know how to extract data from a drop-down list like this:

&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;body&gt;

&lt;select name=&quot;car&quot; size=&quot;1&quot; id=&quot;car&quot;&gt;
  &lt;option value=&quot;volvo&quot;&gt;Volvo&lt;/option&gt;
  &lt;option value=&quot;saab&quot;&gt;Saab&lt;/option&gt;
  &lt;option value=&quot;vw&quot;&gt;VW&lt;/option&gt;
  &lt;option value=&quot;audi&quot; selected&gt;Audi&lt;/option&gt;
&lt;/select&gt;

&lt;select name=&quot;animal&quot; size=&quot;1&quot; id=&quot;animal&quot;&gt;
  &lt;option value=&quot;dog&quot;&gt;Dog&lt;/option&gt;
  &lt;option value=&quot;cat&quot; selected&gt;Cat&lt;/option&gt;
  &lt;option value=&quot;badger&quot;&gt;Badger&lt;/option&gt;
  &lt;option value=&quot;mouse&quot;&gt;Mouse&lt;/option&gt;
&lt;/select&gt;

</body>
</html>

I want to extract the pre-selected options, so the result becomes this:

car = audi
animal = cat

How can I accomplish this? In case golang.org/x/net/html is not capable of doing what I want, what else can I do to extract the data?

答案1

得分: 0

你可以使用"net/html"包来完成这个任务:

package main

import (
	"fmt"
	"golang.org/x/net/html"
	"strings"
)

func main() {
	s := "html"

	result := make(map[string]string)
	d := html.NewTokenizer(strings.NewReader(s))
	currID := ""
	for {
		tokenType := d.Next()
		if tokenType == html.ErrorToken {
			break
		}

		token := d.Token()
		switch tokenType {
		case html.StartTagToken:
			if token.Data == "select" {
				for _, a := range token.Attr {
					if a.Key == "id" {
						currID = a.Val
					}
				}
			}
			if token.Data == "option" {
				isSelected := false
				for _, a := range token.Attr {
					if a.Key == "selected" {
						isSelected = true
					}
				}
				if isSelected {
					for _, a := range token.Attr {
						if a.Key == "value" {
							result[currID] = a.Val
						}
					}
				}
			}
		}
	}

	fmt.Printf("%v\n", result)
}

P.S. 这段代码可以进行改进。

英文:

You absolutely can do it with "net/html":

package main

import (
	&quot;fmt&quot;
	&quot;golang.org/x/net/html&quot;
	&quot;strings&quot;
)

func main() {
	s := &quot;html&quot;

	result := make(map[string]string)
	d := html.NewTokenizer(strings.NewReader(s))
	currID := &quot;&quot;
	for {
		tokenType := d.Next()
		if tokenType == html.ErrorToken {
			break
		}

		token := d.Token()
		switch tokenType {
		case html.StartTagToken:
			if token.Data == &quot;select&quot; {
				for _, a := range token.Attr {
					if a.Key == &quot;id&quot; {
						currID = a.Val
					}
				}
			}
			if token.Data == &quot;option&quot; {
				isSelected := false
				for _, a := range token.Attr {
					if a.Key == &quot;selected&quot; {
						isSelected = true
					}
				}
				if isSelected {
					for _, a := range token.Attr {
						if a.Key == &quot;value&quot; {
							result[currID] = a.Val
						}
					}
				}
			}
		}
	}

	fmt.Printf(&quot;%v\n&quot;, result)
}

P.S. this code can be improved.

答案2

得分: 0

也许可以使用gokogiri来进行XPath选择器的操作:

car, _ := doc.Search("//select[@id='car']/option[@selected]/text()")
animal, _ := doc.Search("//select[@id='animal']/option[@selected]/text()")
英文:

Maybe use gokogiri for xpath selectors:

car, _ := doc.Search(&quot;//select[@id=&#39;car&#39;]/option[@selected]/text()&quot;)
animal, _ := doc.Search(&quot;//select[@id=&#39;animal&#39;]/option[@selected]/text()&quot;)

huangapple
  • 本文由 发表于 2017年3月26日 23:07:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/43030443.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定