如何使用rvest从网站下载表格

huangapple go评论72阅读模式
英文:

How to download table from website using rvest

问题

我想下载一个网站的车站出发图,但是我在尝试网页抓取表格时遇到了困难。有人可以帮助我吗?

library(rvest)
library(tidyverse)

link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&protocol=https:&rt=1&"

html <- read_html(link)

如何使用rvest从网站下载表格

英文:

I want to download a station departure map of a website. However, I'm not really getting anywhere with web scraping the table. Can someone help me?

library(rvest)
library(tidyverse)

link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;protocol=https:&amp;rt=1&amp;&quot;

html &lt;- read_html(link)

如何使用rvest从网站下载表格

答案1

得分: 2

如果您现在在新的浏览器标签或窗口中打开该链接,您将只看到一个搜索表单,而没有时间表。对于 rvest 也是一样的,它必须首先填写并提交一个表单,时间表内容是对搜索表单的响应:

library(rvest)

link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;protocol=https:&amp;rt=1&amp;&quot;
html &lt;- read_html(link)

html %&gt;% 
  # 找到表单
  html_element(&quot;#sqQueryForm&quot;) %&gt;% 
  html_form() %&gt;% 
  # 填写 Bahnhof / Haltestelle 并提交
  html_form_set(input = &quot;Erfurt Hbf&quot;) %&gt;% 
  html_form_submit() %&gt;% 
  # 解析响应
  read_html() %&gt;% 
  html_element(&quot;table.result&quot;) %&gt;% 
  html_table()
#&gt; # A tibble: 84 &#215; 6
#&gt;    Zeit   ``                 Zug          Richtung / Unterwegs…&#185; Gleis Aktuelles
#&gt;    &lt;chr&gt;  &lt;chr&gt;              &lt;chr&gt;        &lt;chr&gt;                  &lt;chr&gt; &lt;chr&gt;    
#&gt;  1 fr&#252;her &quot;&quot;                 &quot;&quot;           &quot;&quot;                     &quot;&quot;    &quot;&quot;       
#&gt;  2 17:26  &quot;&quot;                 &quot;ICE  504&quot;   &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot;   &quot;&#196;nderun…
#&gt;  3 17:28  &quot;&quot;                 &quot;ICE 1171&quot;   &quot;Berlin Hbf (tief)\n\… &quot;2&quot;   &quot;18:29,G…
#&gt;  4 18:09  &quot;&quot;                 &quot;ICE 1002&quot;   &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;1&quot;   &quot;18:44,G…
#&gt;  5 18:22  &quot;aktuelle Uhrzeit&quot; &quot;aktuelle U… &quot;&quot;                     &quot;&quot;    &quot;&quot;       
#&gt;  6 18:22  &quot;&quot;                 &quot;STR    2&quot;   &quot;P+R-Platz Messe, Erf… &quot;-\n… &quot;18:22&quot;  
#&gt;  7 18:23  &quot;&quot;                 &quot;STR    6&quot;   &quot;Steigerstra&#223;e, Erfur… &quot;-\n… &quot;18:23&quot;  
#&gt;  8 18:23  &quot;&quot;                 &quot;Bus  220&quot;   &quot;Busbahnhof, S&#246;mmerda… &quot;-\n… &quot;18:23&quot;  
#&gt;  9 18:24  &quot;&quot;                 &quot;ICE  704&quot;   &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot;   &quot;18:25&quot;  
#&gt; 10 18:24  &quot;&quot;                 &quot;STR    2&quot;   &quot;Wiesenh&#252;gel, Erfurt\… &quot;-\n… &quot;18:24&quot;  
#&gt; # ℹ 74 more rows
#&gt; # ℹ abbreviated name: &#185;​`Richtung / Unterwegshaltestellen`

<sup>创建于2023年7月31日,使用 reprex v2.0.2</sup>

英文:

If you now open that link in a new browser tab or window, you will also end up with just a search form and no timetable. It's same for rvest, it must first fill and submit a form, content with a timetable is a response to the search form:

library(rvest)

link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;protocol=https:&amp;rt=1&amp;&quot;
html &lt;- read_html(link)

html %&gt;% 
  # find form
  html_element(&quot;#sqQueryForm&quot;) %&gt;% 
  html_form() %&gt;% 
  # fill Bahnhof / Haltestelle &amp; submit
  html_form_set(input = &quot;Erfurt Hbf&quot;) %&gt;% 
  html_form_submit() %&gt;% 
  # parse response
  read_html() %&gt;% 
  html_element(&quot;table.result&quot;) %&gt;% 
  html_table()
#&gt; # A tibble: 84 &#215; 6
#&gt;    Zeit   ``                 Zug          Richtung / Unterwegs…&#185; Gleis Aktuelles
#&gt;    &lt;chr&gt;  &lt;chr&gt;              &lt;chr&gt;        &lt;chr&gt;                  &lt;chr&gt; &lt;chr&gt;    
#&gt;  1 fr&#252;her &quot;&quot;                 &quot;&quot;           &quot;&quot;                     &quot;&quot;    &quot;&quot;       
#&gt;  2 17:26  &quot;&quot;                 &quot;ICE  504&quot;   &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot;   &quot;&#196;nderun…
#&gt;  3 17:28  &quot;&quot;                 &quot;ICE 1171&quot;   &quot;Berlin Hbf (tief)\n\… &quot;2&quot;   &quot;18:29,G…
#&gt;  4 18:09  &quot;&quot;                 &quot;ICE 1002&quot;   &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;1&quot;   &quot;18:44,G…
#&gt;  5 18:22  &quot;aktuelle Uhrzeit&quot; &quot;aktuelle U… &quot;&quot;                     &quot;&quot;    &quot;&quot;       
#&gt;  6 18:22  &quot;&quot;                 &quot;STR    2&quot;   &quot;P+R-Platz Messe, Erf… &quot;-\n… &quot;18:22&quot;  
#&gt;  7 18:23  &quot;&quot;                 &quot;STR    6&quot;   &quot;Steigerstra&#223;e, Erfur… &quot;-\n… &quot;18:23&quot;  
#&gt;  8 18:23  &quot;&quot;                 &quot;Bus  220&quot;   &quot;Busbahnhof, S&#246;mmerda… &quot;-\n… &quot;18:23&quot;  
#&gt;  9 18:24  &quot;&quot;                 &quot;ICE  704&quot;   &quot;M&#252;nchen Hbf\n\n\n\nM… &quot;9&quot;   &quot;18:25&quot;  
#&gt; 10 18:24  &quot;&quot;                 &quot;STR    2&quot;   &quot;Wiesenh&#252;gel, Erfurt\… &quot;-\n… &quot;18:24&quot;  
#&gt; # ℹ 74 more rows
#&gt; # ℹ abbreviated name: &#185;​`Richtung / Unterwegshaltestellen`

<sup>Created on 2023-07-31 with reprex v2.0.2</sup>

答案2

得分: 1

表单值以POST请求传输,这意味着值不作为URL路径的一部分传输,而是作为封装的有效负载传输。有趣的是,当我们点击“spaeter”时,会使用GET请求,并且我们可以看到带有参数的URL。我们可以使用该URL访问时刻表:

library(rvest)
library(tidyverse)

link <- "https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&country=DEU&protocol=https:&rt=1&input=Erfurt%20Hbf%238010101&boardType=dep&time=18:23%2B60&productsFilter=11111&&&date=31.07.23&&&selectDate=&maxJourneys=&start=yes"

response_html <- read_html(link)

response_html |>
  html_table() |>
  pluck(2)
#> # A tibble: 24 × 6
#>    Zeit   ``                 Zug          Richtung / Unterwegs…¹ Gleis Aktuelles
#>    <chr>  <chr>              <chr>        <chr>                  <chr> <chr>    
#>  1 fr&#252;her ""                 ""           ""                     ""    ""       
#>  2 18:44  "aktuelle Uhrzeit" "aktuelle U… ""                     ""    ""       
#>  3 19:23  ""                 "FLX 1246"   "Berlin Hbf (tief)\n\… "10"  "19:34"  
#>  4 19:28  ""                 "ICE  502"   "Hamburg-Altona\n\n\n… "9"   "19:39,G…
#>  5 19:30  ""                 "ICE  273"   "Karlsruhe Hbf\n\n\n\… "2"   "Änderun…
#>  6 19:32  ""                 "ICE  801"   "München Hbf\n\n\n\nE… "1"   "19:32"  
#>  7 19:35  ""                 "RE     7(3… "Würzburg Hbf\n\n\n\n… "3a"  ""       
#>  8 19:36  ""                 "RE    17(7… "Naumburg(Saale)Hbf\n… "4"   ""       
#>  9 19:38  ""                 "RB    23(8… "Saalfeld(Saale)\n\n\… "6"   "19:38"  
#> 10 19:38  ""                 "RB    46(8… "Ilmenau\n\n\n\nErfur… "6"   "19:38"  
#> # ℹ 14 more rows
#> # ℹ abbreviated name: ¹​`Richtung / Unterwegshaltestellen`

另一个选项是使用Fahrplan API。您可以请求API密钥,还有一个与Fahrplan API交互的R包:openbahn

英文:

The form values are transmitted as a POST request wich means that the values are not transmitted as part of the URL path, but rather in an encapsulated payload. Interestingly though when we click on "spaeter" a GET request is used, and we can see a URL with the parameters. We can use that URL to access the time table:

library(rvest)
library(tidyverse)

link &lt;- &quot;https://reiseauskunft.bahn.de/bin/bhftafel.exe/dn?ld=4391&amp;country=DEU&amp;protocol=https:&amp;rt=1&amp;input=Erfurt%20Hbf%238010101&amp;boardType=dep&amp;time=18:23%2B60&amp;productsFilter=11111&amp;&amp;&amp;date=31.07.23&amp;&amp;selectDate=&amp;maxJourneys=&amp;start=yes&quot;

response_html &lt;- read_html(link)

response_html |&gt; 
  html_table() |&gt; 
  pluck(2)
#&gt; # A tibble: 24 &#215; 6
#&gt;    Zeit   ``                 Zug          Richtung / Unterwegs…&#185; Gleis Aktuelles
#&gt;    &lt;chr&gt;  &lt;chr&gt;              &lt;chr&gt;        &lt;chr&gt;                  &lt;chr&gt; &lt;chr&gt;    
#&gt;  1 fr&#252;her &quot;&quot;                 &quot;&quot;           &quot;&quot;                     &quot;&quot;    &quot;&quot;       
#&gt;  2 18:44  &quot;aktuelle Uhrzeit&quot; &quot;aktuelle U… &quot;&quot;                     &quot;&quot;    &quot;&quot;       
#&gt;  3 19:23  &quot;&quot;                 &quot;FLX 1246&quot;   &quot;Berlin Hbf (tief)\n\… &quot;10&quot;  &quot;19:34&quot;  
#&gt;  4 19:28  &quot;&quot;                 &quot;ICE  502&quot;   &quot;Hamburg-Altona\n\n\n… &quot;9&quot;   &quot;19:39,G…
#&gt;  5 19:30  &quot;&quot;                 &quot;ICE  273&quot;   &quot;Karlsruhe Hbf\n\n\n\… &quot;2&quot;   &quot;&#196;nderun…
#&gt;  6 19:32  &quot;&quot;                 &quot;ICE  801&quot;   &quot;M&#252;nchen Hbf\n\n\n\nE… &quot;1&quot;   &quot;19:32&quot;  
#&gt;  7 19:35  &quot;&quot;                 &quot;RE     7(3… &quot;W&#252;rzburg Hbf\n\n\n\n… &quot;3a&quot;  &quot;&quot;       
#&gt;  8 19:36  &quot;&quot;                 &quot;RE    17(7… &quot;Naumburg(Saale)Hbf\n… &quot;4&quot;   &quot;&quot;       
#&gt;  9 19:38  &quot;&quot;                 &quot;RB    23(8… &quot;Saalfeld(Saale)\n\n\… &quot;6&quot;   &quot;19:38&quot;  
#&gt; 10 19:38  &quot;&quot;                 &quot;RB    46(8… &quot;Ilmenau\n\n\n\nErfur… &quot;6&quot;   &quot;19:38&quot;  
#&gt; # ℹ 14 more rows
#&gt; # ℹ abbreviated name: &#185;​`Richtung / Unterwegshaltestellen`

Another option would be to use the Fahrplan API. You can request an API Key and there is an R package to interact with the Farhplan API: openbahn.

huangapple
  • 本文由 发表于 2023年7月31日 23:53:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76805247.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定