英文:
Parsing HTML with go-colly and function returns an empty slice
问题
我正在使用colly框架解析一个网站,但出现了一些问题。我有一个非常基本的函数getweeks()
来获取并返回一些内容,但我得到的是一个空的切片。
func getWeeks(c *colly.Collector) []string {
var wks []string
c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
weekName := div.DOM.Find("span").Text() // 一个字符串 Week 1,Week 2 等
wks = append(wks, weekName) // weekName 有实际的值,不是空的
// 如果在这里打印`wks`,它会正确地显示每次迭代时切片如何被填充
})
return wks // 返回 []
}
func main() {
c := colly.NewCollector(
)
w := getWeeks(c)
fmt.Println(w) // []
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
})
c.Visit("目标网址")
}
英文:
I'm parsing a web site with the colly framework and something wrong is happening. I have a very basic function getweeks()
to grab and return something, yet I'm getting an empty slice instead.
func getWeeks(c *colly.Collector) []string {
var wks []string
c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
weekName := div.DOM.Find("span").Text() // a string Week 1, Week 2 etc
wks = append(wks, weekName) // weekName has actual value is not empty
// If `wks` printed here it shows correctly how the slice gets populated on each iteration
})
return wks // returns []
}
func main() {
c := colly.NewCollector(
)
w := getWeeks(c)
fmt.Println(w) // []
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
})
c.Visit("target url")
}
</details>
# 答案1
**得分**: 2
**tl;dr**: 在`OnHTML`回调函数中更新了切片头部,但在`main`函数中打印的值是旧的切片头部。你应该使用`*[]string`来解决这个问题。
首先,你传递给`c.OnHTML`的回调函数实际上只会在你调用`c.Visit`之后运行,所以在`getWeeks`之后立即打印`w`,无论如何都会显示一个空的切片。
然而,即使在`c.Visit`之后打印它,它仍然是一个空的切片,为什么呢?
在Go中,切片是作为一个数据结构实现的,称为切片头部(更多信息:[1](https://stackoverflow.com/questions/52380391/print-address-of-slice-in-golang),[2](https://stackoverflow.com/questions/54195834/how-to-inspect-slice-header))。
当你赋值`getWeeks`的返回值时,实际上是在复制切片头部,包括它的字段`Data`、`Len`和`Cap`。你可以在[这个示例](https://go.dev/play/p/0mi2_qLthn8)中通过使用`%p`格式化符打印切片的地址(使用一些其他的结构体代替go-colly,使示例自包含)来看到这一点:
```go
func getWeeks(c *Foo) []string {
var wks []string
c.OnHTML("div.ltbluediv", func(text string) {
weekName := text
wks = append(wks, weekName)
})
fmt.Printf("%p\n", &wks)
return wks
}
func main() {
c := &Foo{}
w := getWeeks(c)
c.Visit("target url")
fmt.Printf("%p\n", &w)
}
打印出两个不同的内存地址:
0xc0000ac030
0xc0000ac018
现在,如果你继续在Stack Overflow上搜索关于切片和append
行为的问题,你可能会发现,如果切片有足够的容量(1,2,3),则不会重新分配底层数组。
然而,即使你确保底层数组是相同的,通过初始化wks
时具有足够的容量,w
的值仍然是原始切片头部的副本,因此长度为0。这在这个示例中得到了证明:
in getWeeks reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:1, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:2, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:3, Cap:3}
[]
in main reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
你可以通过重新切片来调整w
的长度(示例):
c.Visit("target url")
w = w[0:3]
fmt.Println(w) // [foo bar baz]
但这意味着你需要事先知道一个合理的容量,以避免重新分配,并且需要知道最终要重新切片的长度。
相反,你可以返回一个切片的指针:
func getWeeks(c *colly.Collector) *[]string {
wks := &[]string{}
c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
weekName := div.DOM.Find("span").Text()
*wks = append(*wks, weekName)
})
return wks
}
或者将一个指针传递给getWeeks
:
func getWeeks(c *colly.Collector, wks *[]string) {
c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
weekName := div.DOM.Find("span").Text()
*wks = append(*wks, weekName)
})
}
修复后的示例:https://go.dev/play/p/yhq8YYnkFsv
英文:
tl;dr: The slice header is updated inside OnHTML
callback, but the value you print in main
is the old slice header. You should work with *[]string
instead.
<hr>
First of all, the callback you pass to c.OnHTML
will actually run only after you call c.Visit
, so printing w
right after getWeeks
, would show an empty slice in any case.
However it would be an empty slice even by printing it after c.Visit
, why?
A slice in Go is implemented as a data structure — called slice header (more info: 1, 2).
When you assign the return value of getWeeks
, you're essentially copying the slice header, including its fields Data
, Len
and Cap
. You can see it in this playground by printing the address of the slices with %p
verb (using some other struct instead of go-colly to make the example self-contained):
func getWeeks(c *Foo) []string {
var wks []string
c.OnHTML("div.ltbluediv", func(text string) {
weekName := text
wks = append(wks, weekName)
})
fmt.Printf("%p\n", &wks)
return wks
}
func main() {
c := &Foo{}
w := getWeeks(c)
c.Visit("target url")
fmt.Printf("%p\n", &w)
}
Prints two different memory addresses:
0xc0000ac030
0xc0000ac018
Now if you keep fishing around on Stack Overflow about slice and append
behavior, you may find out that if the slice has sufficient capacity (1, 2, 3) the backing array is not reallocated.
However even if you do make sure the backing array is the same by initializing wks
with sufficient capacity, the value of w
is still a copy of the original slice header, therefore with 0 length. This is demonstrated in this playground, which prints:
in getWeeks reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:1, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:2, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:3, Cap:3}
[]
in main reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
You could adjust the length of w
by reslicing it (playground):
c.Visit("target url")
w = w[0:3]
fmt.Println(w) // [foo bar baz]
But this means that you need to know beforehand a reasonable capacity that doesn't cause reallocation, and the final length to reslice to.
Instead, return a pointer to a slice:
func getWeeks(c *colly.Collector) *[]string {
wks := &[]string{}
c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
weekName := div.DOM.Find("span").Text()
*wks = append(*wks, weekName)
})
return wks
}
Or pass a pointer into getWeeks
:
func getWeeks(c *colly.Collector, wks *[]string) {
c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
weekName := div.DOM.Find("span").Text()
*wks = append(*wks, weekName)
})
}
Fixed playground: https://go.dev/play/p/yhq8YYnkFsv
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论