2023年7月14日 03:55:10go评论150阅读模式

英文:

Trouble with web scraping in R?

问题

在R中，我正在尝试使用CSS选择器#article来网页抓取以下两个网页的文本：

https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm
https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm

以下是我用来网页抓取第二个链接的示例代码：

link2 <- "https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm"
page <- read_html(link2)
name <- page %>% html_nodes("#article") %>% html_text() #之前是#press
name <- str_c(name, collapse=" ")

在第一个网页上，这段代码正常工作，提取了所有所需的文本。然而，当我在第二个链接上运行这段代码（在代码块/示例代码中显示的链接），它只返回空格和零实际文本。有人能帮我理解为什么这段代码在第一个链接上正常工作，但在第二个链接上不起作用吗？

英文:

In R, I am trying to webscrape the text of the following two webpages, using the CSS selector #article:

https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm
https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm

Here is an example of the code I run to web scrape the second link:

link2&lt;-&quot;https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm&quot;
page=read_html(link2)
name=page%&gt;%html_nodes(&quot;#article&quot;)%&gt;%html_text() #earlier this was #press
name&lt;-str_c(name, collapse=&quot; &quot;)

On the first webpage, this code works fine, and extracts all the desired text. However, when I run this code on the second link (the link displayed in the code block/example code), it only returns empty spaces and zero actual text. Can anyone help me understand why this code works fine on the first link but not the second?

答案1

得分: 0

正如评论中所指出的，这两个页面的布局和结构不同，因此"#article"选择器返回不同的结果（即空集）。

如果您确定没有ID冲突，您可以组合两个页面的选择器，使用"#article, #printThis"来处理两种类型的页面：

library(rvest)
library(stringr)
links <- c(link1 = "https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm",
           link2 = "https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm")
lapply(links, \(lnk)
         read_html(lnk) |>
         html_elements("#article, #printThis") |>
         html_text() |>
         str_squish() |>
         str_trunc(80)
)

创建于2023年07月13日，使用reprex v2.0.2。

英文:

As noted in comments, those 2 pages do not share layout and structure, so it's kind of expected that the "#article" selector returns different results (i.e. empty set)

If you are sure there's no id clash, you can combine selectors for both pages and use "#article, #printThis" for both types of pages:

library(rvest)
library(stringr)
links &lt;- c(link1 = &quot;https://www.federalreserve.gov/monetarypolicy/fomcminutes20230201.htm&quot;,
           link2 = &quot;https://www.federalreserve.gov/monetarypolicy/fomcminutes20111102.htm&quot;)
lapply(links, \(lnk)
         read_html(lnk) |&gt;
         html_elements(&quot;#article, #printThis&quot;) |&gt; 
         html_text() |&gt; 
         str_squish() |&gt;
         str_trunc(80)
)
#&gt; $link1
#&gt; [1] &quot;Minutes of the Federal Open Market Committee January 31–February 1, 2023 A jo...&quot;
#&gt; 
#&gt; $link2
#&gt; [1] &quot;Print Minutes of the Federal Open Market Committee November 1-2, 2011 FOMC Mi...&quot;

<sup>Created on 2023-07-13 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中进行网页抓取遇到问题？

问题

答案1

根据R中某一列中特定数量的唯一值，筛选数据框。

在不引起页面加载的情况下，定时更新ASP.NET中的HTML标签值

如何在达到父组件的宽度时将段落移到下一行

在一个动态创建的不是按钮的元素上使用 getElementById

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。