“rvest::html_nodes()”的结果不一致。

huangapple go评论58阅读模式
英文:

Inconsistent results from rvest::html_nodes()

问题

我运行这个代码时,第一次解析和第二次解析之间为什么会得到不同的结果,尽管代码相同,查询的HTML文档没有改变?

我运行的是rvest v1.0.3版本。

英文:

When I run this:

# load required libraries
library(rvest)
library(tidyverse)
 
# provide sample url
url <- "https://www.thewholesaler.co.uk/cgi-bin/go.cgi?id=3074"
 
# read html
doc <- rvest::read_html(url)

# parse html 1st time
doc %>% rvest::html_nodes("meta")

# parse html 2nd time
doc %>% rvest::html_nodes("meta")

I get this:

> # parse html 1st time
> doc %>% rvest::html_nodes("meta")
{xml_nodeset (2)}
[1] <meta name="robots" content="noindex">\n
[2] <meta http-equiv="REFRESH" content="0;URL=http://www.puckator-dropship.co.uk/gifts/\t">
> 
> # parse html 2nd time
> doc %>% rvest::html_nodes("meta")
{xml_nodeset (3)}
[1] <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n
[2] <meta name="robots" content="noindex">\n
[3] <meta http-equiv="REFRESH" content="0;URL=http://www.puckator-dropship.co.uk/gifts/\t">

Why am I getting different results between the 1st and 2nd parsing even though the code is identical and the html doc being queried remains unchanged?

I'm running rvest v1.0.3.

答案1

得分: 1

当你运行html_nodes(doc, "meta")时,我认为这与R缓存调用方式有关。R会将它存储在本地内存中(在某种程度上与XML编码相互关联)。当你再次调用html_nodes(doc, "meta")时,R决定只读取上次存储的内容。然而,由于HTML_nodes与XML编码紧密相关,其中一个节点 - 描述了R如何存储其缓存版本的节点也会被返回。这涉及到R如何处理指针(对象)以及如何处理内存的问题。我希望我在这方面更加专业,但在进行了一些调查后,这是我能够归纳出的最多的信息。这非常有趣!

如果你愿意,你可以在GitHub上向tidyverse/rvest存储库提交一个问题。

英文:

Interestingly, I think this is coming from how R caches calls. When you run html_nodes(doc, "meta"), I believe R is storing that in local memory (which is intertwined with XML encodings, to some degree). When you call html_nodes(doc, "meta") again, R decides to just read what it stored last time. However, since HTML_nodes is closely tied to XML encodings, one of the nodes - the one describing how R stored its cached version - is returned as well. This gets into how R handles pointers (objects), and how it handles memory in general. I wish I was more of an expert on this stuff, but after doing a little digging, this is the most I could distill. This is very interesting!

If you wanted, you could submit an issue to the tidyverse/rvest repository on GitHub.

huangapple
  • 本文由 发表于 2023年4月11日 06:39:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75981246.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定