使用go-colly解析HTML并返回空切片的函数

huangapple go评论89阅读模式
英文:

Parsing HTML with go-colly and function returns an empty slice

问题

我正在使用colly框架解析一个网站,但出现了一些问题。我有一个非常基本的函数getweeks()来获取并返回一些内容,但我得到的是一个空的切片。

func getWeeks(c *colly.Collector) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()  // 一个字符串 Week 1,Week 2 等
        wks = append(wks, weekName)  // weekName 有实际的值,不是空的
        // 如果在这里打印`wks`,它会正确地显示每次迭代时切片如何被填充
    })
    return wks  // 返回 []
}

func main() {
    c := colly.NewCollector(
    )

    w := getWeeks(c)
    fmt.Println(w)  // []

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
    })

    c.Visit("目标网址")

}
英文:

I'm parsing a web site with the colly framework and something wrong is happening. I have a very basic function getweeks() to grab and return something, yet I'm getting an empty slice instead.

func getWeeks(c *colly.Collector) []string {
	var wks []string
	c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
		weekName := div.DOM.Find("span").Text()  // a string Week 1, Week 2 etc 
		wks = append(wks, weekName)  // weekName has actual value is not empty
        // If `wks` printed here it shows correctly how the slice gets populated on each iteration
	})
	return wks  // returns []
}

func main() {
	c := colly.NewCollector(
	)

	w := getWeeks(c)
	fmt.Println(w)  // []

	c.OnRequest(func(r *colly.Request) {
		r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
	})

	c.Visit("target url")

}



</details>


# 答案1
**得分**: 2

**tl;dr**: 在`OnHTML`回调函数中更新了切片头部,但在`main`函数中打印的值是旧的切片头部。你应该使用`*[]string`来解决这个问题。

首先,你传递给`c.OnHTML`的回调函数实际上只会在你调用`c.Visit`之后运行,所以在`getWeeks`之后立即打印`w`,无论如何都会显示一个空的切片。

然而,即使在`c.Visit`之后打印它,它仍然是一个空的切片,为什么呢?

在Go中,切片是作为一个数据结构实现的,称为切片头部(更多信息:[1](https://stackoverflow.com/questions/52380391/print-address-of-slice-in-golang),[2](https://stackoverflow.com/questions/54195834/how-to-inspect-slice-header))。

当你赋值`getWeeks`的返回值时,实际上是在复制切片头部,包括它的字段`Data`、`Len`和`Cap`。你可以在[这个示例](https://go.dev/play/p/0mi2_qLthn8)中通过使用`%p`格式化符打印切片的地址(使用一些其他的结构体代替go-colly,使示例自包含)来看到这一点:

```go
func getWeeks(c *Foo) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(text string) {
        weekName := text
        wks = append(wks, weekName)
    })
    fmt.Printf("%p\n", &wks)
    return wks
}

func main() {
    c := &Foo{}

    w := getWeeks(c)

    c.Visit("target url")
    fmt.Printf("%p\n", &w)
}

打印出两个不同的内存地址:

0xc0000ac030
0xc0000ac018

现在,如果你继续在Stack Overflow上搜索关于切片和append行为的问题,你可能会发现,如果切片有足够的容量(123),则不会重新分配底层数组。

然而,即使你确保底层数组是相同的,通过初始化wks时具有足够的容量,w的值仍然是原始切片头部的副本,因此长度为0。这在这个示例中得到了证明:

in getWeeks reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:1, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:2, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:3, Cap:3}
[]
in main reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}

你可以通过重新切片来调整w的长度(示例):

c.Visit("target url")
w = w[0:3]
fmt.Println(w) // [foo bar baz]

但这意味着你需要事先知道一个合理的容量,以避免重新分配,并且需要知道最终要重新切片的长度。

相反,你可以返回一个切片的指针:

func getWeeks(c *colly.Collector) *[]string {
    wks := &[]string{}
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName) 
    })
    return wks
}

或者将一个指针传递给getWeeks

func getWeeks(c *colly.Collector, wks *[]string) {
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName)
    })
}

修复后的示例:https://go.dev/play/p/yhq8YYnkFsv

英文:

tl;dr: The slice header is updated inside OnHTML callback, but the value you print in main is the old slice header. You should work with *[]string instead.

<hr>

First of all, the callback you pass to c.OnHTML will actually run only after you call c.Visit, so printing w right after getWeeks, would show an empty slice in any case.

However it would be an empty slice even by printing it after c.Visit, why?

A slice in Go is implemented as a data structure — called slice header (more info: 1, 2).

When you assign the return value of getWeeks, you're essentially copying the slice header, including its fields Data, Len and Cap. You can see it in this playground by printing the address of the slices with %p verb (using some other struct instead of go-colly to make the example self-contained):

func getWeeks(c *Foo) []string {
	var wks []string
	c.OnHTML(&quot;div.ltbluediv&quot;, func(text string) {
		weekName := text
		wks = append(wks, weekName)
	})
	fmt.Printf(&quot;%p\n&quot;, &amp;wks)
	return wks
}

func main() {
	c := &amp;Foo{}

	w := getWeeks(c)

	c.Visit(&quot;target url&quot;)
	fmt.Printf(&quot;%p\n&quot;, &amp;w)

}

Prints two different memory addresses:

0xc0000ac030
0xc0000ac018

Now if you keep fishing around on Stack Overflow about slice and append behavior, you may find out that if the slice has sufficient capacity (1, 2, 3) the backing array is not reallocated.

However even if you do make sure the backing array is the same by initializing wks with sufficient capacity, the value of w is still a copy of the original slice header, therefore with 0 length. This is demonstrated in this playground, which prints:

in getWeeks reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:1, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:2, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:3, Cap:3}
[]
in main reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}

You could adjust the length of w by reslicing it (playground):

c.Visit(&quot;target url&quot;)
w = w[0:3]
fmt.Println(w) // [foo bar baz]

But this means that you need to know beforehand a reasonable capacity that doesn't cause reallocation, and the final length to reslice to.

Instead, return a pointer to a slice:

func getWeeks(c *colly.Collector) *[]string {
    wks := &amp;[]string{}
    c.OnHTML(&quot;div.ltbluediv&quot;, func(div *colly.HTMLElement) {
        weekName := div.DOM.Find(&quot;span&quot;).Text()
        *wks = append(*wks, weekName) 
    })
    return wks
}

Or pass a pointer into getWeeks:

func getWeeks(c *colly.Collector, wks *[]string) {
    c.OnHTML(&quot;div.ltbluediv&quot;, func(div *colly.HTMLElement) {
        weekName := div.DOM.Find(&quot;span&quot;).Text()
        *wks = append(*wks, weekName)
    })
}

Fixed playground: https://go.dev/play/p/yhq8YYnkFsv

huangapple
  • 本文由 发表于 2022年5月3日 17:54:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/72097676.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定