英文:
Webscraping using Rvest
问题
我想要使用R语言对以下网站进行网页抓取。
https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047
我想要提取“市场趋势”标题下的表格数据。
我尝试了以下代码,但未成功:
url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
page <- read_html(url)
contentnodes <- page %>%
html_nodes("div.css-15dn4s") %>%
html_attr("table") %>%
jsonlite::fromJSON()
谢谢。
英文:
I would like to webscrape the following website using R.
https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047
I would like to pull the data in the table underneath the "Market Trends" heading
I've tried the following which does not work
url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
page <- read_html(url)
contentnodes <- page %>%
html_nodes("div.css-15dn4s") %>%
html_attr("table") %>%
jsonlite::fromJSON()
Thanks in Advance
答案1
得分: 1
你需要查看中间内容。如果你只是逐步执行你的管道,你会注意到问题出现的时候。
contentnodes <- page %>%
html_nodes("div.css-15dn4s")
contentnodes
# {xml_nodeset (0)}
让我们用另一种方式找到它。为了查找HTML类,我们可以使用以 .
开头的字符串:
page %>%
html_nodes(".css-15dn4s8")
# {xml_nodeset (1)}
# [1] <table aria-describedby="market-data-context" class="css-15dn4s8">
# <thea ...
page %>%
html_nodes(".css-15dn4s8") %>%
html_table()
# [[1]]
# # A tibble: 7 × 7
# Bed@media (max-width:720px){.css…¹ Type Media…² Avg d…³ Clear…⁴ Sold …⁵ ``
# <chr> <chr> <chr> <chr> <chr> <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
# 2 3 House $2.242m 41 days 54% 28 Open
# 3 4 House $3.4m 23 days 61% 23 Open
# 4 5 House $3.745m 66 days 77% 15 Open
# 5 1 Unit $740k - - 11 Open
# 6 2 Unit $1.1m 46 days 74% 68 Open
# 7 3 Unit $1.825m 60 days 74% 27 Open
# # … with abbreviated variable names
# # ¹`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# # ²`Median price`, ³`Avg days on market`, ⁴`Clearance rate`,
# # ⁵`Sold this year`
这是一个带有表格的列表,我们可以使用 `[[`
提取它:
contentnodes <- page %>%
html_nodes(".css-15dn4s8") %>%
html_table()
contentnodes <- contentnodes[[1]]
as.data.frame(contentnodes)
# *** output flushed ***
contentnodes
# # A tibble: 7 × 7
# Bed@media (max-width:720px){.css…¹ Type Media…² Avg d…³ Clear…⁴ Sold …⁵ ``
# <chr> <chr> <chr> <chr> <chr> <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
# 2 3 House $2.242m 41 days 54% 28 Open
# 3 4 House $3.4m 23 days 61% 23 Open
# 4 5 House $3.745m 66 days 77% 15 Open
# 5 1 Unit $740k - - 11 Open
# 6 2 Unit $1.1m 46 days 74% 68 Open
# 7 3 Unit $1.825m 60 days 74% 27 Open
# # … with abbreviated variable names
# # ¹`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# # ²`Median price`, ³`Avg days on market`, ⁴`Clearance rate`,
# # ⁵`Sold this year`
英文:
You need to look at intermediate contents. If you just step through your pipe, you can notice when things go wrong.
contentnodes <- page %>%
html_nodes("div.css-15dn4s")
contentnodes
# {xml_nodeset (0)}
Let's find it a different way. To find html classes, we can use .
-leading strings:
page %>%
html_nodes(".css-15dn4s8")
# {xml_nodeset (1)}
# [1] <table aria-describedby="market-data-context" class="css-15dn4s8">\n<thea ...
page %>%
html_nodes(".css-15dn4s8") %>%
html_table()
# [[1]]
# # A tibble: 7 × 7
# Bed@media (max-width:720px){.css…¹ Type Media…² Avg d…³ Clear…⁴ Sold …⁵ ``
# <chr> <chr> <chr> <chr> <chr> <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
# 2 3 House $2.242m 41 days 54% 28 Open
# 3 4 House $3.4m 23 days 61% 23 Open
# 4 5 House $3.745m 66 days 77% 15 Open
# 5 1 Unit $740k - - 11 Open
# 6 2 Unit $1.1m 46 days 74% 68 Open
# 7 3 Unit $1.825m 60 days 74% 27 Open
# # … with abbreviated variable names
# # ¹`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# # ²`Median price`, ³`Avg days on market`, ⁴`Clearance rate`,
# # ⁵`Sold this year`
This is a list with a table, we can extract it using `[[`
:
contentnodes <- page %>%
html_nodes(".css-15dn4s8") %>%
html_table()
contentnodes <- contentnodes[[1]]
as.data.frame(contentnodes)
# *** output flushed ***
contentnodes
# # A tibble: 7 × 7
# Bed@media (max-width:720px){.css…¹ Type Media…² Avg d…³ Clear…⁴ Sold …⁵ ``
# <chr> <chr> <chr> <chr> <chr> <int> <chr>
# 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
# 2 3 House $2.242m 41 days 54% 28 Open
# 3 4 House $3.4m 23 days 61% 23 Open
# 4 5 House $3.745m 66 days 77% 15 Open
# 5 1 Unit $740k - - 11 Open
# 6 2 Unit $1.1m 46 days 74% 68 Open
# 7 3 Unit $1.825m 60 days 74% 27 Open
# # … with abbreviated variable names
# # ¹`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
# # ²`Median price`, ³`Avg days on market`, ⁴`Clearance rate`,
# # ⁵`Sold this year`
答案2
得分: 0
这似乎有效。
抑制包启动消息({
图书馆(rvest)
图书馆(dplyr)
})
url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
页面 <- read_html(url)
contentnodes <- 页面 %>%
html_element("div.css-0") %>%
html_table() %>%
`[`(-1)
contentnodes
#> # A tibble: 7 × 6
#> Type `Median price` `Avg days on market` `Clearance rate` Sold this y…* ` `
#> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 House $1.685m 32 days 67% 17 .css…
#> 2 House $2.242m 41 days 54% 28 Open
#> 3 House $3.4m 23 days 61% 23 Open
#> 4 House $3.745m 66 days 77% 15 Open
#> 5 Unit $740k - - 11 Open
#> 6 Unit $1.1m 46 days 74% 68 Open
#> 7 Unit $1.825m 60 days 74% 27 Open
#> # … with abbreviated variable name *`Sold this year`
创建于2023年4月17日,使用 reprex v2.0.2
英文:
This seems to work.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
})
url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
page <- read_html(url)
contentnodes <- page %>%
html_element("div.css-0") %>%
html_table() %>%
`[`(-1)
contentnodes
#> # A tibble: 7 × 6
#> Type `Median price` `Avg days on market` `Clearance rate` Sold this y…¹ ``
#> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 House $1.685m 32 days 67% 17 .css…
#> 2 House $2.242m 41 days 54% 28 Open
#> 3 House $3.4m 23 days 61% 23 Open
#> 4 House $3.745m 66 days 77% 15 Open
#> 5 Unit $740k - - 11 Open
#> 6 Unit $1.1m 46 days 74% 68 Open
#> 7 Unit $1.825m 60 days 74% 27 Open
#> # … with abbreviated variable name ¹`Sold this year`
<sup>Created on 2023-04-17 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论