英文:
How to download table from website using rvest
问题
我想下载一个网站的车站出发图,但是我在尝试网页抓取表格时遇到了困难。有人可以帮助我吗?
library(rvest)
library(tidyverse)
link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&protocol=https:&rt=1&"
html <- read_html(link)
英文:
I want to download a station departure map of a website. However, I'm not really getting anywhere with web scraping the table. Can someone help me?
library(rvest)
library(tidyverse)
link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&protocol=https:&rt=1&"
html <- read_html(link)
答案1
得分: 2
如果您现在在新的浏览器标签或窗口中打开该链接,您将只看到一个搜索表单,而没有时间表。对于 rvest
也是一样的,它必须首先填写并提交一个表单,时间表内容是对搜索表单的响应:
library(rvest)
link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&protocol=https:&rt=1&"
html <- read_html(link)
html %>%
# 找到表单
html_element("#sqQueryForm") %>%
html_form() %>%
# 填写 Bahnhof / Haltestelle 并提交
html_form_set(input = "Erfurt Hbf") %>%
html_form_submit() %>%
# 解析响应
read_html() %>%
html_element("table.result") %>%
html_table()
#> # A tibble: 84 × 6
#> Zeit `` Zug Richtung / Unterwegs…¹ Gleis Aktuelles
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 früher "" "" "" "" ""
#> 2 17:26 "" "ICE 504" "München Hbf\n\n\n\nM… "9" "Änderun…
#> 3 17:28 "" "ICE 1171" "Berlin Hbf (tief)\n\… "2" "18:29,G…
#> 4 18:09 "" "ICE 1002" "München Hbf\n\n\n\nM… "1" "18:44,G…
#> 5 18:22 "aktuelle Uhrzeit" "aktuelle U… "" "" ""
#> 6 18:22 "" "STR 2" "P+R-Platz Messe, Erf… "-\n… "18:22"
#> 7 18:23 "" "STR 6" "Steigerstraße, Erfur… "-\n… "18:23"
#> 8 18:23 "" "Bus 220" "Busbahnhof, Sömmerda… "-\n… "18:23"
#> 9 18:24 "" "ICE 704" "München Hbf\n\n\n\nM… "9" "18:25"
#> 10 18:24 "" "STR 2" "Wiesenhügel, Erfurt\… "-\n… "18:24"
#> # ℹ 74 more rows
#> # ℹ abbreviated name: ¹`Richtung / Unterwegshaltestellen`
<sup>创建于2023年7月31日,使用 reprex v2.0.2</sup>
英文:
If you now open that link in a new browser tab or window, you will also end up with just a search form and no timetable. It's same for rvest
, it must first fill and submit a form, content with a timetable is a response to the search form:
library(rvest)
link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&protocol=https:&rt=1&"
html <- read_html(link)
html %>%
# find form
html_element("#sqQueryForm") %>%
html_form() %>%
# fill Bahnhof / Haltestelle & submit
html_form_set(input = "Erfurt Hbf") %>%
html_form_submit() %>%
# parse response
read_html() %>%
html_element("table.result") %>%
html_table()
#> # A tibble: 84 × 6
#> Zeit `` Zug Richtung / Unterwegs…¹ Gleis Aktuelles
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 früher "" "" "" "" ""
#> 2 17:26 "" "ICE 504" "München Hbf\n\n\n\nM… "9" "Änderun…
#> 3 17:28 "" "ICE 1171" "Berlin Hbf (tief)\n\… "2" "18:29,G…
#> 4 18:09 "" "ICE 1002" "München Hbf\n\n\n\nM… "1" "18:44,G…
#> 5 18:22 "aktuelle Uhrzeit" "aktuelle U… "" "" ""
#> 6 18:22 "" "STR 2" "P+R-Platz Messe, Erf… "-\n… "18:22"
#> 7 18:23 "" "STR 6" "Steigerstraße, Erfur… "-\n… "18:23"
#> 8 18:23 "" "Bus 220" "Busbahnhof, Sömmerda… "-\n… "18:23"
#> 9 18:24 "" "ICE 704" "München Hbf\n\n\n\nM… "9" "18:25"
#> 10 18:24 "" "STR 2" "Wiesenhügel, Erfurt\… "-\n… "18:24"
#> # ℹ 74 more rows
#> # ℹ abbreviated name: ¹`Richtung / Unterwegshaltestellen`
<sup>Created on 2023-07-31 with reprex v2.0.2</sup>
答案2
得分: 1
表单值以POST
请求传输,这意味着值不作为URL路径的一部分传输,而是作为封装的有效负载传输。有趣的是,当我们点击“spaeter”时,会使用GET请求,并且我们可以看到带有参数的URL。我们可以使用该URL访问时刻表:
library(rvest)
library(tidyverse)
link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&country=DEU&protocol=https:&rt=1&input=Erfurt%20Hbf%238010101&boardType=dep&time=18:23%2B60&productsFilter=11111&&&date=31.07.23&&&selectDate=&maxJourneys=&start=yes"
response_html <- read_html(link)
response_html |>
html_table() |>
pluck(2)
#> # A tibble: 24 × 6
#> Zeit `` Zug Richtung / Unterwegs…¹ Gleis Aktuelles
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 früher "" "" "" "" ""
#> 2 18:44 "aktuelle Uhrzeit" "aktuelle U… "" "" ""
#> 3 19:23 "" "FLX 1246" "Berlin Hbf (tief)\n\… "10" "19:34"
#> 4 19:28 "" "ICE 502" "Hamburg-Altona\n\n\n… "9" "19:39,G…
#> 5 19:30 "" "ICE 273" "Karlsruhe Hbf\n\n\n\… "2" "Änderun…
#> 6 19:32 "" "ICE 801" "München Hbf\n\n\n\nE… "1" "19:32"
#> 7 19:35 "" "RE 7(3… "Würzburg Hbf\n\n\n\n… "3a" ""
#> 8 19:36 "" "RE 17(7… "Naumburg(Saale)Hbf\n… "4" ""
#> 9 19:38 "" "RB 23(8… "Saalfeld(Saale)\n\n\… "6" "19:38"
#> 10 19:38 "" "RB 46(8… "Ilmenau\n\n\n\nErfur… "6" "19:38"
#> # ℹ 14 more rows
#> # ℹ abbreviated name: ¹`Richtung / Unterwegshaltestellen`
另一个选项是使用Fahrplan API。您可以请求API密钥,还有一个与Fahrplan API交互的R包:openbahn。
英文:
The form values are transmitted as a POST
request wich means that the values are not transmitted as part of the URL path, but rather in an encapsulated payload. Interestingly though when we click on "spaeter" a GET request is used, and we can see a URL with the parameters. We can use that URL to access the time table:
library(rvest)
library(tidyverse)
link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&country=DEU&protocol=https:&rt=1&input=Erfurt%20Hbf%238010101&boardType=dep&time=18:23%2B60&productsFilter=11111&&&date=31.07.23&&selectDate=&maxJourneys=&start=yes"
response_html <- read_html(link)
response_html |>
html_table() |>
pluck(2)
#> # A tibble: 24 × 6
#> Zeit `` Zug Richtung / Unterwegs…¹ Gleis Aktuelles
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 früher "" "" "" "" ""
#> 2 18:44 "aktuelle Uhrzeit" "aktuelle U… "" "" ""
#> 3 19:23 "" "FLX 1246" "Berlin Hbf (tief)\n\… "10" "19:34"
#> 4 19:28 "" "ICE 502" "Hamburg-Altona\n\n\n… "9" "19:39,G…
#> 5 19:30 "" "ICE 273" "Karlsruhe Hbf\n\n\n\… "2" "Änderun…
#> 6 19:32 "" "ICE 801" "München Hbf\n\n\n\nE… "1" "19:32"
#> 7 19:35 "" "RE 7(3… "Würzburg Hbf\n\n\n\n… "3a" ""
#> 8 19:36 "" "RE 17(7… "Naumburg(Saale)Hbf\n… "4" ""
#> 9 19:38 "" "RB 23(8… "Saalfeld(Saale)\n\n\… "6" "19:38"
#> 10 19:38 "" "RB 46(8… "Ilmenau\n\n\n\nErfur… "6" "19:38"
#> # ℹ 14 more rows
#> # ℹ abbreviated name: ¹`Richtung / Unterwegshaltestellen`
Another option would be to use the Fahrplan API. You can request an API Key and there is an R package to interact with the Farhplan API: openbahn.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论