如何使用rvest从网站下载表格

huangapple go评论105阅读模式
英文:

How to download table from website using rvest

问题

我想下载一个网站的车站出发图,但是我在尝试网页抓取表格时遇到了困难。有人可以帮助我吗?

  1. library(rvest)
  2. library(tidyverse)
  3. link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&protocol=https:&rt=1&"
  4. html <- read_html(link)

如何使用rvest从网站下载表格

英文:

I want to download a station departure map of a website. However, I'm not really getting anywhere with web scraping the table. Can someone help me?

  1. library(rvest)
  2. library(tidyverse)
  3. link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;protocol=https:&amp;rt=1&amp;&quot;
  4. html &lt;- read_html(link)

如何使用rvest从网站下载表格

答案1

得分: 2

如果您现在在新的浏览器标签或窗口中打开该链接,您将只看到一个搜索表单,而没有时间表。对于 rvest 也是一样的,它必须首先填写并提交一个表单,时间表内容是对搜索表单的响应:

  1. library(rvest)
  2. link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;protocol=https:&amp;rt=1&amp;&quot;
  3. html &lt;- read_html(link)
  4. html %&gt;%
  5. # 找到表单
  6. html_element(&quot;#sqQueryForm&quot;) %&gt;%
  7. html_form() %&gt;%
  8. # 填写 Bahnhof / Haltestelle 并提交
  9. html_form_set(input = &quot;Erfurt Hbf&quot;) %&gt;%
  10. html_form_submit() %&gt;%
  11. # 解析响应
  12. read_html() %&gt;%
  13. html_element(&quot;table.result&quot;) %&gt;%
  14. html_table()
  15. #&gt; # A tibble: 84 &#215; 6
  16. #&gt; Zeit `` Zug Richtung / Unterwegs…&#185; Gleis Aktuelles
  17. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  18. #&gt; 1 fr&#252;her &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot;
  19. #&gt; 2 17:26 &quot;&quot; &quot;ICE 504&quot; &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot; &quot;&#196;nderun…
  20. #&gt; 3 17:28 &quot;&quot; &quot;ICE 1171&quot; &quot;Berlin Hbf (tief)\n\… &quot;2&quot; &quot;18:29,G…
  21. #&gt; 4 18:09 &quot;&quot; &quot;ICE 1002&quot; &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;1&quot; &quot;18:44,G…
  22. #&gt; 5 18:22 &quot;aktuelle Uhrzeit&quot; &quot;aktuelle U… &quot;&quot; &quot;&quot; &quot;&quot;
  23. #&gt; 6 18:22 &quot;&quot; &quot;STR 2&quot; &quot;P+R-Platz Messe, Erf… &quot;-\n… &quot;18:22&quot;
  24. #&gt; 7 18:23 &quot;&quot; &quot;STR 6&quot; &quot;Steigerstra&#223;e, Erfur… &quot;-\n… &quot;18:23&quot;
  25. #&gt; 8 18:23 &quot;&quot; &quot;Bus 220&quot; &quot;Busbahnhof, S&#246;mmerda… &quot;-\n… &quot;18:23&quot;
  26. #&gt; 9 18:24 &quot;&quot; &quot;ICE 704&quot; &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot; &quot;18:25&quot;
  27. #&gt; 10 18:24 &quot;&quot; &quot;STR 2&quot; &quot;Wiesenh&#252;gel, Erfurt\… &quot;-\n… &quot;18:24&quot;
  28. #&gt; # ℹ 74 more rows
  29. #&gt; # ℹ abbreviated name: &#185;​`Richtung / Unterwegshaltestellen`

<sup>创建于2023年7月31日,使用 reprex v2.0.2</sup>

英文:

If you now open that link in a new browser tab or window, you will also end up with just a search form and no timetable. It's same for rvest, it must first fill and submit a form, content with a timetable is a response to the search form:

  1. library(rvest)
  2. link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;protocol=https:&amp;rt=1&amp;&quot;
  3. html &lt;- read_html(link)
  4. html %&gt;%
  5. # find form
  6. html_element(&quot;#sqQueryForm&quot;) %&gt;%
  7. html_form() %&gt;%
  8. # fill Bahnhof / Haltestelle &amp; submit
  9. html_form_set(input = &quot;Erfurt Hbf&quot;) %&gt;%
  10. html_form_submit() %&gt;%
  11. # parse response
  12. read_html() %&gt;%
  13. html_element(&quot;table.result&quot;) %&gt;%
  14. html_table()
  15. #&gt; # A tibble: 84 &#215; 6
  16. #&gt; Zeit `` Zug Richtung / Unterwegs…&#185; Gleis Aktuelles
  17. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  18. #&gt; 1 fr&#252;her &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot;
  19. #&gt; 2 17:26 &quot;&quot; &quot;ICE 504&quot; &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot; &quot;&#196;nderun…
  20. #&gt; 3 17:28 &quot;&quot; &quot;ICE 1171&quot; &quot;Berlin Hbf (tief)\n\… &quot;2&quot; &quot;18:29,G…
  21. #&gt; 4 18:09 &quot;&quot; &quot;ICE 1002&quot; &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;1&quot; &quot;18:44,G…
  22. #&gt; 5 18:22 &quot;aktuelle Uhrzeit&quot; &quot;aktuelle U… &quot;&quot; &quot;&quot; &quot;&quot;
  23. #&gt; 6 18:22 &quot;&quot; &quot;STR 2&quot; &quot;P+R-Platz Messe, Erf… &quot;-\n… &quot;18:22&quot;
  24. #&gt; 7 18:23 &quot;&quot; &quot;STR 6&quot; &quot;Steigerstra&#223;e, Erfur… &quot;-\n… &quot;18:23&quot;
  25. #&gt; 8 18:23 &quot;&quot; &quot;Bus 220&quot; &quot;Busbahnhof, S&#246;mmerda… &quot;-\n… &quot;18:23&quot;
  26. #&gt; 9 18:24 &quot;&quot; &quot;ICE 704&quot; &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot; &quot;18:25&quot;
  27. #&gt; 10 18:24 &quot;&quot; &quot;STR 2&quot; &quot;Wiesenh&#252;gel, Erfurt\… &quot;-\n… &quot;18:24&quot;
  28. #&gt; # ℹ 74 more rows
  29. #&gt; # ℹ abbreviated name: &#185;​`Richtung / Unterwegshaltestellen`

<sup>Created on 2023-07-31 with reprex v2.0.2</sup>

答案2

得分: 1

表单值以POST请求传输,这意味着值不作为URL路径的一部分传输,而是作为封装的有效负载传输。有趣的是,当我们点击“spaeter”时,会使用GET请求,并且我们可以看到带有参数的URL。我们可以使用该URL访问时刻表:

  1. library(rvest)
  2. library(tidyverse)
  3. link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&country=DEU&protocol=https:&rt=1&input=Erfurt%20Hbf%238010101&boardType=dep&time=18:23%2B60&productsFilter=11111&&&date=31.07.23&&&selectDate=&maxJourneys=&start=yes"
  4. response_html <- read_html(link)
  5. response_html |>
  6. html_table() |>
  7. pluck(2)
  8. #> # A tibble: 24 × 6
  9. #> Zeit `` Zug Richtung / Unterwegs…¹ Gleis Aktuelles
  10. #> <chr> <chr> <chr> <chr> <chr> <chr>
  11. #> 1 fr&#252;her "" "" "" "" ""
  12. #> 2 18:44 "aktuelle Uhrzeit" "aktuelle U… "" "" ""
  13. #> 3 19:23 "" "FLX 1246" "Berlin Hbf (tief)\n\… "10" "19:34"
  14. #> 4 19:28 "" "ICE 502" "Hamburg-Altona\n\n\n… "9" "19:39,G…
  15. #> 5 19:30 "" "ICE 273" "Karlsruhe Hbf\n\n\n\… "2" "Änderun…
  16. #> 6 19:32 "" "ICE 801" "München Hbf\n\n\n\nE… "1" "19:32"
  17. #> 7 19:35 "" "RE 7(3… "Würzburg Hbf\n\n\n\n… "3a" ""
  18. #> 8 19:36 "" "RE 17(7… "Naumburg(Saale)Hbf\n… "4" ""
  19. #> 9 19:38 "" "RB 23(8… "Saalfeld(Saale)\n\n\… "6" "19:38"
  20. #> 10 19:38 "" "RB 46(8… "Ilmenau\n\n\n\nErfur… "6" "19:38"
  21. #> # ℹ 14 more rows
  22. #> # ℹ abbreviated name: ¹​`Richtung / Unterwegshaltestellen`

另一个选项是使用Fahrplan API。您可以请求API密钥,还有一个与Fahrplan API交互的R包:openbahn

英文:

The form values are transmitted as a POST request wich means that the values are not transmitted as part of the URL path, but rather in an encapsulated payload. Interestingly though when we click on "spaeter" a GET request is used, and we can see a URL with the parameters. We can use that URL to access the time table:

  1. library(rvest)
  2. library(tidyverse)
  3. link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;country=DEU&amp;protocol=https:&amp;rt=1&amp;input=Erfurt%20Hbf%238010101&amp;boardType=dep&amp;time=18:23%2B60&amp;productsFilter=11111&amp;&amp;&amp;date=31.07.23&amp;&amp;selectDate=&amp;maxJourneys=&amp;start=yes&quot;
  4. response_html &lt;- read_html(link)
  5. response_html |&gt;
  6. html_table() |&gt;
  7. pluck(2)
  8. #&gt; # A tibble: 24 &#215; 6
  9. #&gt; Zeit `` Zug Richtung / Unterwegs…&#185; Gleis Aktuelles
  10. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  11. #&gt; 1 fr&#252;her &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot;
  12. #&gt; 2 18:44 &quot;aktuelle Uhrzeit&quot; &quot;aktuelle U… &quot;&quot; &quot;&quot; &quot;&quot;
  13. #&gt; 3 19:23 &quot;&quot; &quot;FLX 1246&quot; &quot;Berlin Hbf (tief)\n\… &quot;10&quot; &quot;19:34&quot;
  14. #&gt; 4 19:28 &quot;&quot; &quot;ICE 502&quot; &quot;Hamburg-Altona\n\n\n… &quot;9&quot; &quot;19:39,G…
  15. #&gt; 5 19:30 &quot;&quot; &quot;ICE 273&quot; &quot;Karlsruhe Hbf\n\n\n\… &quot;2&quot; &quot;&#196;nderun…
  16. #&gt; 6 19:32 &quot;&quot; &quot;ICE 801&quot; &quot;M&#252;nchen Hbf\n\n\n\nE… &quot;1&quot; &quot;19:32&quot;
  17. #&gt; 7 19:35 &quot;&quot; &quot;RE 7(3… &quot;W&#252;rzburg Hbf\n\n\n\n… &quot;3a&quot; &quot;&quot;
  18. #&gt; 8 19:36 &quot;&quot; &quot;RE 17(7… &quot;Naumburg(Saale)Hbf\n… &quot;4&quot; &quot;&quot;
  19. #&gt; 9 19:38 &quot;&quot; &quot;RB 23(8… &quot;Saalfeld(Saale)\n\n\… &quot;6&quot; &quot;19:38&quot;
  20. #&gt; 10 19:38 &quot;&quot; &quot;RB 46(8… &quot;Ilmenau\n\n\n\nErfur… &quot;6&quot; &quot;19:38&quot;
  21. #&gt; # ℹ 14 more rows
  22. #&gt; # ℹ abbreviated name: &#185;​`Richtung / Unterwegshaltestellen`

Another option would be to use the Fahrplan API. You can request an API Key and there is an R package to interact with the Farhplan API: openbahn.

huangapple
  • 本文由 发表于 2023年7月31日 23:53:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76805247.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定