使用Golang Colly进行网络爬虫,如何处理找不到XML路径的情况?

huangapple go评论84阅读模式
英文:

Web scrapping using Golang Colly, How to handle XML path not found?

问题

我正在使用Colly来爬取一个电子商务网站。我将循环遍历许多产品。

这是我代码的一部分,用于获取子标题:

c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
    fmt.Println(e.Text)
})

然而,并不是所有的产品都有子标题,所以上述的XML路径并不适用于所有情况。

当我遇到没有子标题的产品时,我的代码会崩溃并返回以下错误:

panic: expression must evaluate to a node-set

以下是我目前的代码:

c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
    log.Println("Something went wrong:", err)
})

// 子标题
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
    fmt.Println(e.Text)
})

c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
})

c.Visit("https://www.lazada.vn/-i1701980654-s7563711492.html")

以下是我想要的:

c.OnXML("/html/b.....v/h1/1234", func(e *colly.XMLElement) {
    if 没有错误 {
        fmt.Println("没有错误")
    } else {
        fmt.Println("发生错误")
    }
})
英文:

I am using Colly for scrapping an ecommerce website. I will loop over many products.

Here is a snippet of my code getting a sub-title

	c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
	    fmt.Println(e.Text)
})

However, not all products have a sub-title so the above XML path does not work for all cases.

When I reach a product which does not have a sub-title my code got crashed and return an error of

panic: expression must evaluate to a node-set

Here is my so far code:

	c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
	log.Println("Something went wrong:", err)
})

//Sub Title
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
	fmt.Println(e.Text)
})

c.OnRequest(func(r *colly.Request) {
	fmt.Println("Visiting", r.URL)
})

c.Visit("https://www.lazada.vn/-i1701980654-s7563711492.html")

Here is what I want

c.OnXML("/html/b.....v/h1/1234", func(e *colly.XMLElement) {
    if no error {

        fmt.Println("NO ERROR)

    } else {

        fmt.Println("GOT ERROR")

    }
    
})

答案1

得分: 1

也许我找到了你代码中出错的地方。让我从最后开始解释。如你所见,错误源自parse.go文件的第473行的panic语句。xpath包中有一个名为parseNodeTest的方法,它进行了以下检查:

func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
	switch p.r.typ {
	case itemName:
		if p.r.canBeFunc && isNodeType(p.r) {
			var prop string
			switch p.r.name {
			case "comment", "text", "processing-instruction", "node":
				prop = p.r.name
			}
			var name string
			p.next()
			p.skipItem(itemLParens)
			if prop == "processing-instruction" && p.r.typ != itemRParens {
				checkItem(p.r, itemString)
				name = p.r.strval
				p.next()
			}
			p.skipItem(itemRParens)
			opnd = newAxisNode(axeTyp, name, "", prop, n)
		} else {
			prefix := p.r.prefix
			name := p.r.name
			p.next()
			if p.r.name == "*" {
				name = ""
			}
			opnd = newAxisNode(axeTyp, name, prefix, "", n)
		}
	case itemStar:
		opnd = newAxisNode(axeTyp, "", "", "", n)
		p.next()
	default:
		panic("expression must evaluate to a node-set")
	}
	return opnd
}

p.r.typ的值是itemNumber28)。这导致switch语句进入默认分支并报错。在上述方法之前调用的方法(你可以在IDE的调用栈中看到它们)将字面量1234typ设置为这个值,导致XPath查询无效。要使其正常工作,你需要摆脱1234并放入一些有效的值。

如果这解决了你的问题,请告诉我,谢谢!

英文:

Maybe I figured out what went wrong in your code. Let me start with the final. As you can see, the error is originated from the panic statement at line 473 of the parse.go file. The package xpath has a method called parseNodeTest that does the following check:

func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
	switch p.r.typ {
	case itemName:
		if p.r.canBeFunc && isNodeType(p.r) {
			var prop string
			switch p.r.name {
			case "comment", "text", "processing-instruction", "node":
				prop = p.r.name
			}
			var name string
			p.next()
			p.skipItem(itemLParens)
			if prop == "processing-instruction" && p.r.typ != itemRParens {
				checkItem(p.r, itemString)
				name = p.r.strval
				p.next()
			}
			p.skipItem(itemRParens)
			opnd = newAxisNode(axeTyp, name, "", prop, n)
		} else {
			prefix := p.r.prefix
			name := p.r.name
			p.next()
			if p.r.name == "*" {
				name = ""
			}
			opnd = newAxisNode(axeTyp, name, prefix, "", n)
		}
	case itemStar:
		opnd = newAxisNode(axeTyp, "", "", "", n)
		p.next()
	default:
		panic("expression must evaluate to a node-set")
	}
	return opnd
}

The value of p.r.typ is itemNumber (28). This leads the switch to enter into the default branch and gives the error. The methods invoked before the above-mentioned one (you can see them in the call stack of your IDE) set the typ for the literal 1234 to this value and this caused an invalid XPath query. To make it works, you've to get rid of the 1234 and put some valid value.
Let me know if this solves your issue, thanks!

huangapple
  • 本文由 发表于 2022年12月29日 18:01:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/74949682.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定