使用Rvest进行网页抓取

huangapple go评论94阅读模式
英文:

Webscraping using Rvest

问题

我想要使用R语言对以下网站进行网页抓取。

https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047

我想要提取“市场趋势”标题下的表格数据。

我尝试了以下代码,但未成功:

  1. url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
  2. page <- read_html(url)
  3. contentnodes <- page %>%
  4. html_nodes("div.css-15dn4s") %>%
  5. html_attr("table") %>%
  6. jsonlite::fromJSON()

谢谢。

英文:

I would like to webscrape the following website using R.

https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047

I would like to pull the data in the table underneath the "Market Trends" heading

I've tried the following which does not work

  1. url &lt;- &quot;https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047&quot;
  2. page &lt;- read_html(url)
  3. contentnodes &lt;- page %&gt;%
  4. html_nodes(&quot;div.css-15dn4s&quot;) %&gt;%
  5. html_attr(&quot;table&quot;) %&gt;%
  6. jsonlite::fromJSON()

Thanks in Advance

答案1

得分: 1

你需要查看中间内容。如果你只是逐步执行你的管道,你会注意到问题出现的时候。

  1. contentnodes <- page %>%
  2. html_nodes("div.css-15dn4s")
  3. contentnodes
  4. # {xml_nodeset (0)}

让我们用另一种方式找到它。为了查找HTML类,我们可以使用以 . 开头的字符串:

  1. page %>%
  2. html_nodes(".css-15dn4s8")
  3. # {xml_nodeset (1)}
  4. # [1] <table aria-describedby="market-data-context" class="css-15dn4s8">
  5. # <thea ...
  6. page %>%
  7. html_nodes(".css-15dn4s8") %>%
  8. html_table()
  9. # [[1]]
  10. # # A tibble: 7 × 7
  11. # Bed@media (max-width:720px){.css…¹ Type Media…² Avg d…³ Clear…⁴ Sold …⁵ ``
  12. # <chr> <chr> <chr> <chr> <chr> <int> <chr>
  13. # 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
  14. # 2 3 House $2.242m 41 days 54% 28 Open
  15. # 3 4 House $3.4m 23 days 61% 23 Open
  16. # 4 5 House $3.745m 66 days 77% 15 Open
  17. # 5 1 Unit $740k - - 11 Open
  18. # 6 2 Unit $1.1m 46 days 74% 68 Open
  19. # 7 3 Unit $1.825m 60 days 74% 27 Open
  20. # # … with abbreviated variable names
  21. # # ¹​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
  22. # # ²​`Median price`, ³​`Avg days on market`, ⁴​`Clearance rate`,
  23. # # ⁵​`Sold this year`

这是一个带有表格的列表,我们可以使用 `[[` 提取它:

  1. contentnodes <- page %>%
  2. html_nodes(".css-15dn4s8") %>%
  3. html_table()
  4. contentnodes <- contentnodes[[1]]
  5. as.data.frame(contentnodes)
  6. # *** output flushed ***
  7. contentnodes
  8. # # A tibble: 7 × 7
  9. # Bed@media (max-width:720px){.css…¹ Type Media…² Avg d…³ Clear…⁴ Sold …⁵ ``
  10. # <chr> <chr> <chr> <chr> <chr> <int> <chr>
  11. # 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
  12. # 2 3 House $2.242m 41 days 54% 28 Open
  13. # 3 4 House $3.4m 23 days 61% 23 Open
  14. # 4 5 House $3.745m 66 days 77% 15 Open
  15. # 5 1 Unit $740k - - 11 Open
  16. # 6 2 Unit $1.1m 46 days 74% 68 Open
  17. # 7 3 Unit $1.825m 60 days 74% 27 Open
  18. # # … with abbreviated variable names
  19. # # ¹​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
  20. # # ²​`Median price`, ³​`Avg days on market`, ⁴​`Clearance rate`,
  21. # # ⁵​`Sold this year`
英文:

You need to look at intermediate contents. If you just step through your pipe, you can notice when things go wrong.

  1. contentnodes &lt;- page %&gt;%
  2. html_nodes(&quot;div.css-15dn4s&quot;)
  3. contentnodes
  4. # {xml_nodeset (0)}

Let's find it a different way. To find html classes, we can use .-leading strings:

  1. page %&gt;%
  2. html_nodes(&quot;.css-15dn4s8&quot;)
  3. # {xml_nodeset (1)}
  4. # [1] &lt;table aria-describedby=&quot;market-data-context&quot; class=&quot;css-15dn4s8&quot;&gt;\n&lt;thea ...
  5. page %&gt;%
  6. html_nodes(&quot;.css-15dn4s8&quot;) %&gt;%
  7. html_table()
  8. # [[1]]
  9. # # A tibble: 7 &#215; 7
  10. # Bed@media (max-width:720px){.css…&#185; Type Media…&#178; Avg d…&#179; Clear…⁴ Sold …⁵ ``
  11. # &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
  12. # 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
  13. # 2 3 House $2.242m 41 days 54% 28 Open
  14. # 3 4 House $3.4m 23 days 61% 23 Open
  15. # 4 5 House $3.745m 66 days 77% 15 Open
  16. # 5 1 Unit $740k - - 11 Open
  17. # 6 2 Unit $1.1m 46 days 74% 68 Open
  18. # 7 3 Unit $1.825m 60 days 74% 27 Open
  19. # # … with abbreviated variable names
  20. # # &#185;​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
  21. # # &#178;​`Median price`, &#179;​`Avg days on market`, ⁴​`Clearance rate`,
  22. # # ⁵​`Sold this year`

This is a list with a table, we can extract it using `[[`:

  1. contentnodes &lt;- page %&gt;%
  2. html_nodes(&quot;.css-15dn4s8&quot;) %&gt;%
  3. html_table()
  4. contentnodes &lt;- contentnodes[[1]]
  5. as.data.frame(contentnodes)
  6. # *** output flushed ***
  7. contentnodes
  8. # # A tibble: 7 &#215; 7
  9. # Bed@media (max-width:720px){.css…&#185; Type Media…&#178; Avg d…&#179; Clear…⁴ Sold …⁵ ``
  10. # &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
  11. # 1 .css-gsqvet{stroke-linejoin:round… House $1.685m 32 days 67% 17 .css…
  12. # 2 3 House $2.242m 41 days 54% 28 Open
  13. # 3 4 House $3.4m 23 days 61% 23 Open
  14. # 4 5 House $3.745m 66 days 77% 15 Open
  15. # 5 1 Unit $740k - - 11 Open
  16. # 6 2 Unit $1.1m 46 days 74% 68 Open
  17. # 7 3 Unit $1.825m 60 days 74% 27 Open
  18. # # … with abbreviated variable names
  19. # # &#185;​`Bed@media (max-width:720px){.css-1lps80v{position:absolute !important;height:1px;width:1px;overflow:hidden;-webkit-clip:rect(1px 1px 1px 1px);clip:rect(1px 1px 1px 1px);-webkit-clip:rect(1px,1px,1px,1px);clip:rect(1px,1px,1px,1px);}}rooms`,
  20. # # &#178;​`Median price`, &#179;​`Avg days on market`, ⁴​`Clearance rate`,
  21. # # ⁵​`Sold this year`

答案2

得分: 0

这似乎有效。

  1. 抑制包启动消息({
  2. 图书馆(rvest)
  3. 图书馆(dplyr)
  4. })
  5. url <- "https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047"
  6. 页面 <- read_html(url)
  7. contentnodes <- 页面 %>%
  8. html_element("div.css-0") %>%
  9. html_table() %>%
  10. `[`(-1)
  11. contentnodes
  12. #> # A tibble: 7 × 6
  13. #> Type `Median price` `Avg days on market` `Clearance rate` Sold this y…* ` `
  14. #> <chr> <chr> <chr> <chr> <int> <chr>
  15. #> 1 House $1.685m 32 days 67% 17 .css…
  16. #> 2 House $2.242m 41 days 54% 28 Open
  17. #> 3 House $3.4m 23 days 61% 23 Open
  18. #> 4 House $3.745m 66 days 77% 15 Open
  19. #> 5 Unit $740k - - 11 Open
  20. #> 6 Unit $1.1m 46 days 74% 68 Open
  21. #> 7 Unit $1.825m 60 days 74% 27 Open
  22. #> # … with abbreviated variable name *​`Sold this year`

创建于2023年4月17日,使用 reprex v2.0.2

英文:

This seems to work.

  1. suppressPackageStartupMessages({
  2. library(rvest)
  3. library(dplyr)
  4. })
  5. url &lt;- &quot;https://www.domain.com.au/suburb-profile/drummoyne-nsw-2047&quot;
  6. page &lt;- read_html(url)
  7. contentnodes &lt;- page %&gt;%
  8. html_element(&quot;div.css-0&quot;) %&gt;%
  9. html_table() %&gt;%
  10. `[`(-1)
  11. contentnodes
  12. #&gt; # A tibble: 7 &#215; 6
  13. #&gt; Type `Median price` `Avg days on market` `Clearance rate` Sold this y…&#185; ``
  14. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
  15. #&gt; 1 House $1.685m 32 days 67% 17 .css…
  16. #&gt; 2 House $2.242m 41 days 54% 28 Open
  17. #&gt; 3 House $3.4m 23 days 61% 23 Open
  18. #&gt; 4 House $3.745m 66 days 77% 15 Open
  19. #&gt; 5 Unit $740k - - 11 Open
  20. #&gt; 6 Unit $1.1m 46 days 74% 68 Open
  21. #&gt; 7 Unit $1.825m 60 days 74% 27 Open
  22. #&gt; # … with abbreviated variable name &#185;​`Sold this year`

<sup>Created on 2023-04-17 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年4月17日 15:29:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76032640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定