2022年12月29日 18:01:28go评论84阅读模式

英文:

Web scrapping using Golang Colly, How to handle XML path not found?

问题

我正在使用Colly来爬取一个电子商务网站。我将循环遍历许多产品。

这是我代码的一部分，用于获取子标题：

c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
    fmt.Println(e.Text)
})

然而，并不是所有的产品都有子标题，所以上述的XML路径并不适用于所有情况。

当我遇到没有子标题的产品时，我的代码会崩溃并返回以下错误：

panic: expression must evaluate to a node-set

以下是我目前的代码：

c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
    log.Println("Something went wrong:", err)
})

// 子标题
c.OnXML("/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234", func(e *colly.XMLElement) {
    fmt.Println(e.Text)
})

c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
})

c.Visit("https://www.lazada.vn/-i1701980654-s7563711492.html")

以下是我想要的：

c.OnXML("/html/b.....v/h1/1234", func(e *colly.XMLElement) {
    if 没有错误 {
        fmt.Println("没有错误")
    } else {
        fmt.Println("发生错误")
    }
})

英文:

I am using Colly for scrapping an ecommerce website. I will loop over many products.

Here is a snippet of my code getting a sub-title

	c.OnXML(&quot;/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234&quot;, func(e *colly.XMLElement) {
	    fmt.Println(e.Text)
})

However, not all products have a sub-title so the above XML path does not work for all cases.

When I reach a product which does not have a sub-title my code got crashed and return an error of

panic: expression must evaluate to a node-set

Here is my so far code:

	c := colly.NewCollector()
c.OnError(func(_ *colly.Response, err error) {
	log.Println(&quot;Something went wrong:&quot;, err)
})

//Sub Title
c.OnXML(&quot;/html/body/div[4]/div/div[3]/div[2]/div/div[1]/div[3]/div/div/h1/1234&quot;, func(e *colly.XMLElement) {
	fmt.Println(e.Text)
})

c.OnRequest(func(r *colly.Request) {
	fmt.Println(&quot;Visiting&quot;, r.URL)
})

c.Visit(&quot;https://www.lazada.vn/-i1701980654-s7563711492.html&quot;)

Here is what I want

c.OnXML(&quot;/html/b.....v/h1/1234&quot;, func(e *colly.XMLElement) {
    if no error {

        fmt.Println(&quot;NO ERROR)

    } else {

        fmt.Println(&quot;GOT ERROR&quot;)

    }
    
})

答案1

得分: 1

也许我找到了你代码中出错的地方。让我从最后开始解释。如你所见，错误源自parse.go文件的第473行的panic语句。xpath包中有一个名为parseNodeTest的方法，它进行了以下检查：

func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
	switch p.r.typ {
	case itemName:
		if p.r.canBeFunc && isNodeType(p.r) {
			var prop string
			switch p.r.name {
			case "comment", "text", "processing-instruction", "node":
				prop = p.r.name
			}
			var name string
			p.next()
			p.skipItem(itemLParens)
			if prop == "processing-instruction" && p.r.typ != itemRParens {
				checkItem(p.r, itemString)
				name = p.r.strval
				p.next()
			}
			p.skipItem(itemRParens)
			opnd = newAxisNode(axeTyp, name, "", prop, n)
		} else {
			prefix := p.r.prefix
			name := p.r.name
			p.next()
			if p.r.name == "*" {
				name = ""
			}
			opnd = newAxisNode(axeTyp, name, prefix, "", n)
		}
	case itemStar:
		opnd = newAxisNode(axeTyp, "", "", "", n)
		p.next()
	default:
		panic("expression must evaluate to a node-set")
	}
	return opnd
}

p.r.typ的值是itemNumber（28）。这导致switch语句进入默认分支并报错。在上述方法之前调用的方法（你可以在IDE的调用栈中看到它们）将字面量1234的typ设置为这个值，导致XPath查询无效。要使其正常工作，你需要摆脱1234并放入一些有效的值。

如果这解决了你的问题，请告诉我，谢谢！

英文:

Maybe I figured out what went wrong in your code. Let me start with the final. As you can see, the error is originated from the panic statement at line 473 of the parse.go file. The package xpath has a method called parseNodeTest that does the following check:

func (p *parser) parseNodeTest(n node, axeTyp string) (opnd node) {
	switch p.r.typ {
	case itemName:
		if p.r.canBeFunc &amp;&amp; isNodeType(p.r) {
			var prop string
			switch p.r.name {
			case &quot;comment&quot;, &quot;text&quot;, &quot;processing-instruction&quot;, &quot;node&quot;:
				prop = p.r.name
			}
			var name string
			p.next()
			p.skipItem(itemLParens)
			if prop == &quot;processing-instruction&quot; &amp;&amp; p.r.typ != itemRParens {
				checkItem(p.r, itemString)
				name = p.r.strval
				p.next()
			}
			p.skipItem(itemRParens)
			opnd = newAxisNode(axeTyp, name, &quot;&quot;, prop, n)
		} else {
			prefix := p.r.prefix
			name := p.r.name
			p.next()
			if p.r.name == &quot;*&quot; {
				name = &quot;&quot;
			}
			opnd = newAxisNode(axeTyp, name, prefix, &quot;&quot;, n)
		}
	case itemStar:
		opnd = newAxisNode(axeTyp, &quot;&quot;, &quot;&quot;, &quot;&quot;, n)
		p.next()
	default:
		panic(&quot;expression must evaluate to a node-set&quot;)
	}
	return opnd
}

The value of p.r.typ is itemNumber (28). This leads the switch to enter into the default branch and gives the error. The methods invoked before the above-mentioned one (you can see them in the call stack of your IDE) set the typ for the literal 1234 to this value and this caused an invalid XPath query. To make it works, you've to get rid of the 1234 and put some valid value.
Let me know if this solves your issue, thanks!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Golang Colly进行网络爬虫，如何处理找不到XML路径的情况？

问题

答案1

Golang并发访问切片

What is the most efficent way to print arrays in golang

将模糊测试应用于解析某个字符串的函数。

golang 导入结构指针

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论