英文:
Inconsistent results from rvest::html_nodes()
问题
我运行这个代码时,第一次解析和第二次解析之间为什么会得到不同的结果,尽管代码相同,查询的HTML文档没有改变?
我运行的是rvest v1.0.3版本。
英文:
When I run this:
# load required libraries
library(rvest)
library(tidyverse)
# provide sample url
url <- "https://www.thewholesaler.co.uk/cgi-bin/go.cgi?id=3074"
# read html
doc <- rvest::read_html(url)
# parse html 1st time
doc %>% rvest::html_nodes("meta")
# parse html 2nd time
doc %>% rvest::html_nodes("meta")
I get this:
> # parse html 1st time
> doc %>% rvest::html_nodes("meta")
{xml_nodeset (2)}
[1] <meta name="robots" content="noindex">\n
[2] <meta http-equiv="REFRESH" content="0;URL=http://www.puckator-dropship.co.uk/gifts/\t">
>
> # parse html 2nd time
> doc %>% rvest::html_nodes("meta")
{xml_nodeset (3)}
[1] <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n
[2] <meta name="robots" content="noindex">\n
[3] <meta http-equiv="REFRESH" content="0;URL=http://www.puckator-dropship.co.uk/gifts/\t">
Why am I getting different results between the 1st and 2nd parsing even though the code is identical and the html doc being queried remains unchanged?
I'm running rvest v1.0.3.
答案1
得分: 1
当你运行html_nodes(doc, "meta")
时,我认为这与R缓存调用方式有关。R会将它存储在本地内存中(在某种程度上与XML编码相互关联)。当你再次调用html_nodes(doc, "meta")
时,R决定只读取上次存储的内容。然而,由于HTML_nodes与XML编码紧密相关,其中一个节点 - 描述了R如何存储其缓存版本的节点也会被返回。这涉及到R如何处理指针(对象)以及如何处理内存的问题。我希望我在这方面更加专业,但在进行了一些调查后,这是我能够归纳出的最多的信息。这非常有趣!
如果你愿意,你可以在GitHub上向tidyverse/rvest存储库提交一个问题。
英文:
Interestingly, I think this is coming from how R caches calls. When you run html_nodes(doc, "meta")
, I believe R is storing that in local memory (which is intertwined with XML encodings, to some degree). When you call html_nodes(doc, "meta")
again, R decides to just read what it stored last time. However, since HTML_nodes is closely tied to XML encodings, one of the nodes - the one describing how R stored its cached version - is returned as well. This gets into how R handles pointers (objects), and how it handles memory in general. I wish I was more of an expert on this stuff, but after doing a little digging, this is the most I could distill. This is very interesting!
If you wanted, you could submit an issue to the tidyverse/rvest repository on GitHub.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论