使用Rvest进行网页抓取

huangapple go评论63阅读模式
英文:

Webscraping using Rvest

问题

我想要使用R语言对以下网站进行网页抓取。

https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047

我想要提取“市场趋势”标题下的表格数据。

我尝试了以下代码,但未成功:

url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"

page <- read_html(url)

contentnodes <- page %>%
  html_nodes("div.css-15dn4s") %>%
  html_attr("table") %>%
  jsonlite::fromJSON()

谢谢。

英文:

I would like to webscrape the following website using R.

https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047

I would like to pull the data in the table underneath the "Market Trends" heading

I've tried the following which does not work

    url &lt;- &quot;https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047&quot;

page &lt;- read_html(url)

contentnodes &lt;- page %&gt;% 
  html_nodes(&quot;div.css-15dn4s&quot;) %&gt;%
  html_attr(&quot;table&quot;) %&gt;%
  jsonlite::fromJSON()

Thanks in Advance

答案1

得分: 1

你需要查看中间内容。如果你只是逐步执行你的管道,你会注意到问题出现的时候。

contentnodes <- page %>%
  html_nodes("div.css-15dn4s")
contentnodes
# {xml_nodeset (0)}

让我们用另一种方式找到它。为了查找HTML类,我们可以使用以 . 开头的字符串:

page %>%
  html_nodes(".css-15dn4s8")
# {xml_nodeset (1)}
# [1] <table aria-describedby="market-data-context" class="css-15dn4s8">
# <thea ...

page %>%
  html_nodes(".css-15dn4s8") %>%
  html_table()
# [[1]]
# # A tibble: 7 × 7
#   Bed@media (max-width:720px){.css…¹ Type  Media…² Avg d…³ Clear…⁴ Sold …⁵ ``   
#   <chr>                              <chr> <chr>   <chr>   <chr>     <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   ¹​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   ²​`Median price`, ³​`Avg days on market`, ⁴​`Clearance rate`,
# #   ⁵​`Sold this year`

这是一个带有表格的列表,我们可以使用 `[[` 提取它:

contentnodes <- page %>%
  html_nodes(".css-15dn4s8") %>%
  html_table()
contentnodes <- contentnodes[[1]]
as.data.frame(contentnodes)
# *** output flushed ***
contentnodes
# # A tibble: 7 × 7
#   Bed@media (max-width:720px){.css…¹ Type  Media…² Avg d…³ Clear…⁴ Sold …⁵ ``   
#   <chr>                              <chr> <chr>   <chr>   <chr>     <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   ¹​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   ²​`Median price`, ³​`Avg days on market`, ⁴​`Clearance rate`,
# #   ⁵​`Sold this year`
英文:

You need to look at intermediate contents. If you just step through your pipe, you can notice when things go wrong.

contentnodes &lt;- page %&gt;% 
  html_nodes(&quot;div.css-15dn4s&quot;)
contentnodes
# {xml_nodeset (0)}

Let's find it a different way. To find html classes, we can use .-leading strings:

page %&gt;%
  html_nodes(&quot;.css-15dn4s8&quot;)
# {xml_nodeset (1)}
# [1] &lt;table aria-describedby=&quot;market-data-context&quot; class=&quot;css-15dn4s8&quot;&gt;\n&lt;thea ...
page %&gt;%
  html_nodes(&quot;.css-15dn4s8&quot;) %&gt;%
  html_table()
# [[1]]
# # A tibble: 7 &#215; 7
#   Bed@media (max-width:720px){.css…&#185; Type  Media…&#178; Avg d…&#179; Clear…⁴ Sold …⁵ ``   
#   &lt;chr&gt;                              &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;     &lt;int&gt; &lt;chr&gt;
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   &#185;​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   &#178;​`Median price`, &#179;​`Avg days on market`, ⁴​`Clearance rate`,
# #   ⁵​`Sold this year`

This is a list with a table, we can extract it using `[[`:

contentnodes &lt;- page %&gt;%
  html_nodes(&quot;.css-15dn4s8&quot;) %&gt;%
  html_table()
contentnodes &lt;- contentnodes[[1]]
as.data.frame(contentnodes)
# *** output flushed ***
contentnodes
# # A tibble: 7 &#215; 7
#   Bed@media (max-width:720px){.css…&#185; Type  Media…&#178; Avg d…&#179; Clear…⁴ Sold …⁵ ``   
#   &lt;chr&gt;                              &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;     &lt;int&gt; &lt;chr&gt;
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67%          17 .css…
# 2 3                                  House $2.242m 41 days 54%          28 Open 
# 3 4                                  House $3.4m   23 days 61%          23 Open 
# 4 5                                  House $3.745m 66 days 77%          15 Open 
# 5 1                                  Unit  $740k   -       -            11 Open 
# 6 2                                  Unit  $1.1m   46 days 74%          68 Open 
# 7 3                                  Unit  $1.825m 60 days 74%          27 Open 
# # … with abbreviated variable names
# #   &#185;​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# #   &#178;​`Median price`, &#179;​`Avg days on market`, ⁴​`Clearance rate`,
# #   ⁵​`Sold this year`

答案2

得分: 0

这似乎有效。

抑制包启动消息({
  图书馆(rvest)
  图书馆(dplyr)
})

url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
页面 <- read_html(url)

contentnodes <- 页面 %>%
  html_element("div.css-0") %>%
  html_table() %>%
  `[`(-1)

contentnodes
#> # A tibble: 7 × 6
#>   Type  `Median price` `Avg days on market` `Clearance rate` Sold this y…* ` `   
#>   <chr> <chr>          <chr>                <chr>                    <int> <chr>
#> 1 House $1.685m        32 days              67%                         17 .css…
#> 2 House $2.242m        41 days              54%                         28 Open 
#> 3 House $3.4m          23 days              61%                         23 Open 
#> 4 House $3.745m        66 days              77%                         15 Open 
#> 5 Unit  $740k          -                    -                           11 Open 
#> 6 Unit  $1.1m          46 days              74%                         68 Open 
#> 7 Unit  $1.825m        60 days              74%                         27 Open 
#> # … with abbreviated variable name *​`Sold this year`

创建于2023年4月17日,使用 reprex v2.0.2

英文:

This seems to work.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})

url &lt;- &quot;https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047&quot;
page &lt;- read_html(url)

contentnodes &lt;- page %&gt;%
  html_element(&quot;div.css-0&quot;) %&gt;%
  html_table() %&gt;%
  `[`(-1)

contentnodes
#&gt; # A tibble: 7 &#215; 6
#&gt;   Type  `Median price` `Avg days on market` `Clearance rate` Sold this y…&#185; ``   
#&gt;   &lt;chr&gt; &lt;chr&gt;          &lt;chr&gt;                &lt;chr&gt;                    &lt;int&gt; &lt;chr&gt;
#&gt; 1 House $1.685m        32 days              67%                         17 .css…
#&gt; 2 House $2.242m        41 days              54%                         28 Open 
#&gt; 3 House $3.4m          23 days              61%                         23 Open 
#&gt; 4 House $3.745m        66 days              77%                         15 Open 
#&gt; 5 Unit  $740k          -                    -                           11 Open 
#&gt; 6 Unit  $1.1m          46 days              74%                         68 Open 
#&gt; 7 Unit  $1.825m        60 days              74%                         27 Open 
#&gt; # … with abbreviated variable name &#185;​`Sold this year`

<sup>Created on 2023-04-17 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年4月17日 15:29:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76032640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定