英文:
Web scrapping using Golang Colly, How to handle XML path not found?
问题
我正在使用Colly来爬取一个电子商务网站。我将循环遍历许多产品。
这是我代码的一部分,用于获取子标题:
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
fmt.Println(e.Text)
})
然而,并不是所有的产品都有子标题,所以上述的XML
路径并不适用于所有情况。
当我遇到没有子标题的产品时,我的代码会崩溃并返回以下错误:
panic: expression must evaluate to a node-set
以下是我目前的代码:
c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
// 子标题
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
fmt.Println(e.Text)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("https://www.lazada.vn/-i1701980654-s7563711492.html")
以下是我想要的:
c.OnXML("/html/b.....v/h1/1234", func(e *colly.XMLElement) {
if 没有错误 {
fmt.Println("没有错误")
} else {
fmt.Println("发生错误")
}
})
英文:
I am using Colly for scrapping an ecommerce website. I will loop over many products.
Here is a snippet of my code getting a sub-title
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
fmt.Println(e.Text)
})
However, not all products have a sub-title so the above XML
path does not work for all cases.
When I reach a product which does not have a sub-title my code got crashed and return an error of
panic: expression must evaluate to a node-set
Here is my so far code:
c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
//Sub Title
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
fmt.Println(e.Text)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("https://www.lazada.vn/-i1701980654-s7563711492.html")
Here is what I want
c.OnXML("/html/b.....v/h1/1234", func(e *colly.XMLElement) {
if no error {
fmt.Println("NO ERROR)
} else {
fmt.Println("GOT ERROR")
}
})
答案1
得分: 1
也许我找到了你代码中出错的地方。让我从最后开始解释。如你所见,错误源自parse.go
文件的第473行的panic
语句。xpath
包中有一个名为parseNodeTest
的方法,它进行了以下检查:
func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
switch p.r.typ {
case itemName:
if p.r.canBeFunc && isNodeType(p.r) {
var prop string
switch p.r.name {
case "comment", "text", "processing-instruction", "node":
prop = p.r.name
}
var name string
p.next()
p.skipItem(itemLParens)
if prop == "processing-instruction" && p.r.typ != itemRParens {
checkItem(p.r, itemString)
name = p.r.strval
p.next()
}
p.skipItem(itemRParens)
opnd = newAxisNode(axeTyp, name, "", prop, n)
} else {
prefix := p.r.prefix
name := p.r.name
p.next()
if p.r.name == "*" {
name = ""
}
opnd = newAxisNode(axeTyp, name, prefix, "", n)
}
case itemStar:
opnd = newAxisNode(axeTyp, "", "", "", n)
p.next()
default:
panic("expression must evaluate to a node-set")
}
return opnd
}
p.r.typ
的值是itemNumber
(28
)。这导致switch
语句进入默认分支并报错。在上述方法之前调用的方法(你可以在IDE的调用栈中看到它们)将字面量1234
的typ
设置为这个值,导致XPath查询无效。要使其正常工作,你需要摆脱1234
并放入一些有效的值。
如果这解决了你的问题,请告诉我,谢谢!
英文:
Maybe I figured out what went wrong in your code. Let me start with the final. As you can see, the error is originated from the panic
statement at line 473 of the parse.go
file. The package xpath
has a method called parseNodeTest
that does the following check:
func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
switch p.r.typ {
case itemName:
if p.r.canBeFunc && isNodeType(p.r) {
var prop string
switch p.r.name {
case "comment", "text", "processing-instruction", "node":
prop = p.r.name
}
var name string
p.next()
p.skipItem(itemLParens)
if prop == "processing-instruction" && p.r.typ != itemRParens {
checkItem(p.r, itemString)
name = p.r.strval
p.next()
}
p.skipItem(itemRParens)
opnd = newAxisNode(axeTyp, name, "", prop, n)
} else {
prefix := p.r.prefix
name := p.r.name
p.next()
if p.r.name == "*" {
name = ""
}
opnd = newAxisNode(axeTyp, name, prefix, "", n)
}
case itemStar:
opnd = newAxisNode(axeTyp, "", "", "", n)
p.next()
default:
panic("expression must evaluate to a node-set")
}
return opnd
}
The value of p.r.typ
is itemNumber
(28
). This leads the switch to enter into the default branch and gives the error. The methods invoked before the above-mentioned one (you can see them in the call stack of your IDE) set the typ
for the literal 1234
to this value and this caused an invalid XPath query. To make it works, you've to get rid of the 1234
and put some valid value.
Let me know if this solves your issue, thanks!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论