在R中进行网页抓取遇到问题?

huangapple go评论120阅读模式
英文:

Trouble with web scraping in R?

问题

在R中,我正在尝试使用CSS选择器#article来网页抓取以下两个网页的文本:

https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm
https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm

以下是我用来网页抓取第二个链接的示例代码:

link2 <- "https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm"

page <- read_html(link2)
name <- page %>% html_nodes("#article") %>% html_text() #之前是#press
name <- str_c(name, collapse=" ")

在第一个网页上,这段代码正常工作,提取了所有所需的文本。然而,当我在第二个链接上运行这段代码(在代码块/示例代码中显示的链接),它只返回空格和零实际文本。有人能帮我理解为什么这段代码在第一个链接上正常工作,但在第二个链接上不起作用吗?

英文:

In R, I am trying to webscrape the text of the following two webpages, using the CSS selector #article:

https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm
https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm

Here is an example of the code I run to web scrape the second link:

link2&lt;-&quot;https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm&quot;

page=read_html(link2)
name=page%&gt;%html_nodes(&quot;#article&quot;)%&gt;%html_text() #earlier this was #press
name&lt;-str_c(name, collapse=&quot; &quot;)

On the first webpage, this code works fine, and extracts all the desired text. However, when I run this code on the second link (the link displayed in the code block/example code), it only returns empty spaces and zero actual text. Can anyone help me understand why this code works fine on the first link but not the second?

答案1

得分: 0

正如评论中所指出的,这两个页面的布局和结构不同,因此"#article"选择器返回不同的结果(即空集)。

如果您确定没有ID冲突,您可以组合两个页面的选择器,使用"#article, #printThis"来处理两种类型的页面:

library(rvest)
library(stringr)

links <- c(link1 = "https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm",
           link2 = "https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm")

lapply(links, \(lnk)
         read_html(lnk) |>
         html_elements("#article, #printThis") |>
         html_text() |>
         str_squish() |>
         str_trunc(80)
)

创建于2023年07月13日,使用reprex v2.0.2

英文:

As noted in comments, those 2 pages do not share layout and structure, so it's kind of expected that the &quot;#article&quot; selector returns different results (i.e. empty set)

If you are sure there's no id clash, you can combine selectors for both pages and use &quot;#article, #printThis&quot; for both types of pages:

library(rvest)
library(stringr)

links &lt;- c(link1 = &quot;https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm&quot;,
           link2 = &quot;https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm&quot;)

lapply(links, \(lnk)
         read_html(lnk) |&gt;
         html_elements(&quot;#article, #printThis&quot;) |&gt; 
         html_text() |&gt; 
         str_squish() |&gt;
         str_trunc(80)
)
#&gt; $link1
#&gt; [1] &quot;Minutes of the Federal Open Market Committee January 31–February 1, 2023 A jo...&quot;
#&gt; 
#&gt; $link2
#&gt; [1] &quot;Print Minutes of the Federal Open Market Committee November 1-2, 2011 FOMC Mi...&quot;

<sup>Created on 2023-07-13 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年7月14日 03:55:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76682825.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定