2023年4月17日 15:29:38go评论94阅读模式

英文:

Webscraping using Rvest

问题

我想要使用R语言对以下网站进行网页抓取。

https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047

我想要提取“市场趋势”标题下的表格数据。

我尝试了以下代码，但未成功：

url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
page <- read_html(url)
contentnodes <- page %>%
  html_nodes("div.css-15dn4s") %>%
  html_attr("table") %>%
  jsonlite::fromJSON()

谢谢。

英文:

I would like to webscrape the following website using R.

https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047

I would like to pull the data in the table underneath the "Market Trends" heading

I've tried the following which does not work

    url &lt;- &quot;https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047&quot;
page &lt;- read_html(url)
contentnodes &lt;- page %&gt;% 
  html_nodes(&quot;div.css-15dn4s&quot;) %&gt;%
  html_attr(&quot;table&quot;) %&gt;%
  jsonlite::fromJSON()

Thanks in Advance

答案1

得分: 1

你需要查看中间内容。如果你只是逐步执行你的管道，你会注意到问题出现的时候。

contentnodes <- page %>%
  html_nodes("div.css-15dn4s")
contentnodes
# {xml_nodeset (0)}

让我们用另一种方式找到它。为了查找HTML类，我们可以使用以 . 开头的字符串：

page %>%
  html_nodes(".css-15dn4s8")
# {xml_nodeset (1)}
# [1] <table aria-describedby="market-data-context" class="css-15dn4s8">
# <thea ...
page %>%
  html_nodes(".css-15dn4s8") %>%
  html_table()
# [[1]]
# # A tibble: 7 × 7
#   Bed@media (max-width:720px){.css…¹ Type  Media…² Avg d…³ Clear…⁴ Sold …⁵ ``   
#   <chr>                              <chr> <chr>   <chr>   <chr>     <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   ¹`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   ²`Median price`, ³`Avg days on market`, ⁴`Clearance rate`,
# #   ⁵`Sold this year`

这是一个带有表格的列表，我们可以使用 `[[` 提取它：

contentnodes <- page %>%
  html_nodes(".css-15dn4s8") %>%
  html_table()
contentnodes <- contentnodes[[1]]
as.data.frame(contentnodes)
# *** output flushed ***
contentnodes
# # A tibble: 7 × 7
#   Bed@media (max-width:720px){.css…¹ Type  Media…² Avg d…³ Clear…⁴ Sold …⁵ ``   
#   <chr>                              <chr> <chr>   <chr>   <chr>     <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   ¹`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   ²`Median price`, ³`Avg days on market`, ⁴`Clearance rate`,
# #   ⁵`Sold this year`

英文:

You need to look at intermediate contents. If you just step through your pipe, you can notice when things go wrong.

contentnodes &lt;- page %&gt;% 
  html_nodes(&quot;div.css-15dn4s&quot;)
contentnodes
# {xml_nodeset (0)}

Let's find it a different way. To find html classes, we can use .-leading strings:

page %&gt;%
  html_nodes(&quot;.css-15dn4s8&quot;)
# {xml_nodeset (1)}
# [1] &lt;table aria-describedby=&quot;market-data-context&quot; class=&quot;css-15dn4s8&quot;&gt;\n&lt;thea ...
page %&gt;%
  html_nodes(&quot;.css-15dn4s8&quot;) %&gt;%
  html_table()
# [[1]]
# # A tibble: 7 &#215; 7
#   Bed@media (max-width:720px){.css…&#185; Type  Media…&#178; Avg d…&#179; Clear…⁴ Sold …⁵ ``   
#   &lt;chr&gt;                              &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;     &lt;int&gt; &lt;chr&gt;
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   &#185;`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   &#178;`Median price`, &#179;`Avg days on market`, ⁴`Clearance rate`,
# #   ⁵`Sold this year`

This is a list with a table, we can extract it using `[[`:

contentnodes &lt;- page %&gt;%
  html_nodes(&quot;.css-15dn4s8&quot;) %&gt;%
  html_table()
contentnodes &lt;- contentnodes[[1]]
as.data.frame(contentnodes)
# *** output flushed ***
contentnodes
# # A tibble: 7 &#215; 7
#   Bed@media (max-width:720px){.css…&#185; Type  Media…&#178; Avg d…&#179; Clear…⁴ Sold …⁵ ``   
#   &lt;chr&gt;                              &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;     &lt;int&gt; &lt;chr&gt;
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   &#185;`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   &#178;`Median price`, &#179;`Avg days on market`, ⁴`Clearance rate`,
# #   ⁵`Sold this year`

答案2

得分: 0

这似乎有效。

抑制包启动消息({
  图书馆(rvest)
  图书馆(dplyr)
})
url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
页面 <- read_html(url)
contentnodes <- 页面 %>%
  html_element("div.css-0") %>%
  html_table() %>%
  `[`(-1)
contentnodes
#> # A tibble: 7 × 6
#>   Type  `Median price` `Avg days on market` `Clearance rate` Sold this y…* ` `   
#>   <chr> <chr>          <chr>                <chr>                    <int> <chr>
#> 1 House $1.685m        32 days              67%                         17 .css…
#> 2 House $2.242m        41 days              54%                         28 Open 
#> 3 House $3.4m          23 days              61%                         23 Open 
#> 4 House $3.745m        66 days              77%                         15 Open 
#> 5 Unit  $740k          -                    -                           11 Open 
#> 6 Unit  $1.1m          46 days              74%                         68 Open 
#> 7 Unit  $1.825m        60 days              74%                         27 Open 
#> # … with abbreviated variable name *`Sold this year`

^{创建于2023年4月17日，使用 reprex v2.0.2}

英文:

This seems to work.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})
url &lt;- &quot;https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047&quot;
page &lt;- read_html(url)
contentnodes &lt;- page %&gt;%
  html_element(&quot;div.css-0&quot;) %&gt;%
  html_table() %&gt;%
  `[`(-1)
contentnodes
#&gt; # A tibble: 7 &#215; 6
#&gt;   Type  `Median price` `Avg days on market` `Clearance rate` Sold this y…&#185; ``   
#&gt;   &lt;chr&gt; &lt;chr&gt;          &lt;chr&gt;                &lt;chr&gt;                    &lt;int&gt; &lt;chr&gt;
#&gt; 1 House $1.685m        32 days              67%                         17 .css…
#&gt; 2 House $2.242m        41 days              54%                         28 Open 
#&gt; 3 House $3.4m          23 days              61%                         23 Open 
#&gt; 4 House $3.745m        66 days              77%                         15 Open 
#&gt; 5 Unit  $740k          -                    -                           11 Open 
#&gt; 6 Unit  $1.1m          46 days              74%                         68 Open 
#&gt; 7 Unit  $1.825m        60 days              74%                         27 Open 
#&gt; # … with abbreviated variable name &#185;`Sold this year`

<sup>Created on 2023-04-17 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Rvest进行网页抓取

问题

答案1

答案2

如何循环以下 group_by

如何在列范围下联合更改值，并在其他列中分别更改。

如何将数据框每一列的值从所有其他列中减去？

Barplot with double x-axis labels

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。