rvest – 浏览网站并下载加拿大水文数据

huangapple go评论74阅读模式
英文:

rvest - navigate site and download Canada hydrometric data

问题

I am creating an R function that takes a station number, navigates the Canada Hydrometric, and downloads all data for this station. I'm encountering a few problems and they may be due to the radio buttons and/or that the search button isn't named. This is what I have:

station_number <- "08NM083"
url <- "https://wateroffice.ec.gc.ca/search/historical_e.html"
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")

my_session <- session(url, user_a)

form <- html_form(my_session)[[2]]

which gives:

<form> 'search-form' (GET https://wateroffice.ec.gc.ca/search/historical_results_e.html)
  <field> (submit) : Search
  <field> (radio) search_type: station_name
  <field> (text) station_name: 
  <field> (radio) search_type: station_number
  <field> (text) station_number: 
  <field> (radio) search_type: province
  <field> (select) province: AB
  <field> (radio) search_type: basin
  <field> (select) basin: 
  <field> (radio) search_type: region
  <field> (select) region: ATL
  <field> (radio) search_type: coordinate
  <field> (number) north_degrees: 
  <field> (number) north_minutes: 
  <field> (number) north_seconds: 
  <field> (number) south_degrees: 
  <field> (number) south_minutes: 
  <field> (number) south_seconds: 
  <field> (number) east_degrees: 
  <field> (number) east_minutes: 
  <field> (number) east_seconds: 
  <field> (number) west_degrees: 
  <field> (number) west_minutes: 
  <field> (number) west_seconds: 
  <field> (select) parameter_type: all
  <field> (number) start_year: 1850
  <field> (number) end_year: 2023
  <field> (number) minimum_years: 
  <field> (checkbox) latest_year: Y
  <field> (select) regulation: all
  <field> (select) station_status: all
  <field> (select) operation_schedule: 
  <field> (select) contributing_agency: all
  <field> (select) gross_drainage_operator: >
  <field> (number) gross_drainage_area: 
  <field> (select) effective_drainage_operator: >
  <field> (number) effective_drainage_area: 
  <field> (select) sediment: ---
  <field> (select) real_time: ---
  <field> (select) rhbn: ---
  <field> (select) contributed: ---
  <field> (submit) : Search

When I fill out the form and submit, however, nothing seems to have changed.

filled <- form %>%
  html_form_set(station_number = station_number, 
                search_type = "station_number")

resp <- session_submit(x = my_session, form = filled)

my_session and resp:

> my_session
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
  Status: 200
  Type:   text/html; charset=UTF-8
  Size:   45034
> resp
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
  Status: 200
  Type:   text/html; charset=UTF-8
  Size:   45284

Any suggestions are welcomed!

Edit

kaliiiiiiiii's suggestion of pasting in the station number into the url has worked wonderfully for this part of my problem! I still cannot figure out how to download the csv file.

Current code:

station_number <- "08NM083"
url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=", 
              station_number, 
              "&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")

my_session <- session(url, user_a)

form <- html_form(my_session)[[2]]

filled <- form %>%
  html_form_set(check_all = "all")

resp <- session_submit(x = my_session, form = filled, submit = "download")
resp

link <- resp %>%
  read_html() %>%
  html_element("p+ section .col-lg-4:nth-child(1) a") %>%
  html_attr("href")

full_link <- url_absolute(link, url)

And my attempts at downloading the file:

download.file(full_link, destfile = "Downloads/test_hydat.csv")
test <- read_csv(full_link)

The two above contain only html code.

英文:

I am creating an R function that takes a station number, navigates the Canada Hydrometric, and downloads all data for this station. I'm encountering a few problems and they may be due to the radio buttons and/or that the search button isn't named. This is what I have:

station_number &lt;- &quot;08NM083&quot;
url &lt;- &quot;https://wateroffice.ec.gc.ca/search/historical_e.html&quot;
user_a &lt;- httr::user_agent(&quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36&quot;)

my_session &lt;- session(url, user_a)

form &lt;- html_form(my_session)[[2]]

which gives:

&lt;form&gt; &#39;search-form&#39; (GET https://wateroffice.ec.gc.ca/search/historical_results_e.html)
  &lt;field&gt; (submit) : Search
  &lt;field&gt; (radio) search_type: station_name
  &lt;field&gt; (text) station_name: 
  &lt;field&gt; (radio) search_type: station_number
  &lt;field&gt; (text) station_number: 
  &lt;field&gt; (radio) search_type: province
  &lt;field&gt; (select) province: AB
  &lt;field&gt; (radio) search_type: basin
  &lt;field&gt; (select) basin: 
  &lt;field&gt; (radio) search_type: region
  &lt;field&gt; (select) region: ATL
  &lt;field&gt; (radio) search_type: coordinate
  &lt;field&gt; (number) north_degrees: 
  &lt;field&gt; (number) north_minutes: 
  &lt;field&gt; (number) north_seconds: 
  &lt;field&gt; (number) south_degrees: 
  &lt;field&gt; (number) south_minutes: 
  &lt;field&gt; (number) south_seconds: 
  &lt;field&gt; (number) east_degrees: 
  &lt;field&gt; (number) east_minutes: 
  &lt;field&gt; (number) east_seconds: 
  &lt;field&gt; (number) west_degrees: 
  &lt;field&gt; (number) west_minutes: 
  &lt;field&gt; (number) west_seconds: 
  &lt;field&gt; (select) parameter_type: all
  &lt;field&gt; (number) start_year: 1850
  &lt;field&gt; (number) end_year: 2023
  &lt;field&gt; (number) minimum_years: 
  &lt;field&gt; (checkbox) latest_year: Y
  &lt;field&gt; (select) regulation: all
  &lt;field&gt; (select) station_status: all
  &lt;field&gt; (select) operation_schedule: 
  &lt;field&gt; (select) contributing_agency: all
  &lt;field&gt; (select) gross_drainage_operator: &gt;
  &lt;field&gt; (number) gross_drainage_area: 
  &lt;field&gt; (select) effective_drainage_operator: &gt;
  &lt;field&gt; (number) effective_drainage_area: 
  &lt;field&gt; (select) sediment: ---
  &lt;field&gt; (select) real_time: ---
  &lt;field&gt; (select) rhbn: ---
  &lt;field&gt; (select) contributed: ---
  &lt;field&gt; (submit) : Search

When I fill out the form and submit, however, nothing seems to have changed.

filled &lt;- form %&gt;% 
  html_form_set(station_number = station_number, 
                search_type = &quot;station_number&quot;)

resp &lt;- session_submit(x = my_session, form = filled)

my_session and resp:

&gt; my_session
&lt;session&gt; https://wateroffice.ec.gc.ca/search/historical_e.html
  Status: 200
  Type:   text/html; charset=UTF-8
  Size:   45034
&gt; resp
&lt;session&gt; https://wateroffice.ec.gc.ca/search/historical_e.html
  Status: 200
  Type:   text/html; charset=UTF-8
  Size:   45284

Any suggestions are welcomed!

Edit

kaliiiiiiiii's suggestion of pasting in the station number into the url has worked wonderfully for this part of my problem! I still cannot figure out how to download the csv file.

Current code:

station_number &lt;- &quot;08NM083&quot;
url &lt;- paste0(&quot;https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&amp;station_number=&quot;, 
              station_number, 
              &quot;&amp;start_year=1850&amp;end_year=2023&amp;minimum_years=&amp;gross_drainage_operator=%3E&amp;gross_drainage_area=&amp;effective_drainage_operator=%3E&amp;effective_drainage_area=&quot;)
user_a &lt;- httr::user_agent(&quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36&quot;)

my_session &lt;- session(url, user_a)

form &lt;- html_form(my_session)[[2]]

filled &lt;- form %&gt;% 
  html_form_set(check_all = &quot;all&quot;)

resp &lt;- session_submit(x = my_session, form = filled, submit = &quot;download&quot;)
resp

link &lt;- resp %&gt;% 
  read_html() %&gt;% 
  html_element(&quot;p+ section .col-lg-4:nth-child(1) a&quot;) %&gt;% 
  html_attr(&quot;href&quot;)

full_link &lt;- url_absolute(link, url)

And my attempts at downloading the file:

download.file(full_link, destfile = &quot;Downloads/test_hydat.csv&quot;)
test &lt;- read_csv(full_link)

The two above contain only html code.

答案1

得分: 0

为什么不直接使用 API:

curl 'https://wateroffice.ec.gc.ca/services/map_data?data_type=historical' \
  -H 'Accept: */*' \
  -H 'Accept-Language: de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4,es;q=0.3' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'DNT: 1' \
  -H 'Pragma: no-cache' \
  -H 'Referer: https://wateroffice.ec.gc.ca/map/index_e.html?type=historical' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44' \
  -H 'X-Requested-With: XMLHttpRequest' \
  -H 'sec-ch-ua: "Microsoft Edge";v="111", "Not(A:Brand";v="8", "Chromium";v="111"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  --compressed

获取所有站点?

对于其他编程语言,请使用 curlconverter 进行转换。

或者你可以直接使用以下 URL 进行搜索:

station_name = "teststation"
url = "https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_name&station_name=" + station_name + "&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area="
英文:

Why don't you just use directly the api:

curl &#39;https://wateroffice.ec.gc.ca/services/map_data?data_type=historical&#39; \
  -H &#39;Accept: */*&#39; \
  -H &#39;Accept-Language: de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4,es;q=0.3&#39; \
  -H &#39;Cache-Control: no-cache&#39; \
  -H &#39;Connection: keep-alive&#39; \
  -H &#39;DNT: 1&#39; \
  -H &#39;Pragma: no-cache&#39; \
  -H &#39;Referer: https://wateroffice.ec.gc.ca/map/index_e.html?type=historical&#39; \
  -H &#39;Sec-Fetch-Dest: empty&#39; \
  -H &#39;Sec-Fetch-Mode: cors&#39; \
  -H &#39;Sec-Fetch-Site: same-origin&#39; \
  -H &#39;User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44&#39; \
  -H &#39;X-Requested-With: XMLHttpRequest&#39; \
  -H &#39;sec-ch-ua: &quot;Microsoft Edge&quot;;v=&quot;111&quot;, &quot;Not(A:Brand&quot;;v=&quot;8&quot;, &quot;Chromium&quot;;v=&quot;111&quot;&#39; \
  -H &#39;sec-ch-ua-mobile: ?0&#39; \
  -H &#39;sec-ch-ua-platform: &quot;Windows&quot;&#39; \
  --compressed

To get all the stations?

For other programming languages, convert with curlconverter

Or you can search directly using the url:

station_name = &quot;teststation&quot;
url = &quot;https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_name&amp;station_name=&quot;+station_name+&quot;&amp;start_year=1850&amp;end_year=2023&amp;minimum_years=&amp;gross_drainage_operator=%3E&amp;gross_drainage_area=&amp;effective_drainage_operator=%3E&amp;effective_drainage_area=&quot;

答案2

得分: 0

已解决!我需要跳转到“下载 CSV” 链接,并具体提取新会话的响应内容。以下是为需要执行类似操作的任何人提供的完整代码:

station_number <- "08NM083"
url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=", 
              station_number, 
              "&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")

my_session <- session(url, user_a)

form <- html_form(my_session)[[2]]

filled <- form %>%
  html_form_set(check_all = "all")

resp <- session_submit(x = my_session, form = filled, submit = "download")

link <- resp %>%
  read_html() %>%
  html_element("p+ section .col-lg-4:nth-child(1) a") %>%
  html_attr("href")

full_link <- url_absolute(link, url)

next_ses <- my_session %>%
  session_jump_to(full_link)

writeBin(next_ses$response$content, "Downloads/test_hydat.csv")
英文:

Figured it out! I needed to jump to the "download csv" link and specifically pull the new session's response content. Full code below for anyone who needs to do something similar:

station_number &lt;- &quot;08NM083&quot;
url &lt;- paste0(&quot;https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&amp;station_number=&quot;, 
              station_number, 
              &quot;&amp;start_year=1850&amp;end_year=2023&amp;minimum_years=&amp;gross_drainage_operator=%3E&amp;gross_drainage_area=&amp;effective_drainage_operator=%3E&amp;effective_drainage_area=&quot;)
user_a &lt;- httr::user_agent(&quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36&quot;)

my_session &lt;- session(url, user_a)

form &lt;- html_form(my_session)[[2]]

filled &lt;- form %&gt;% 
  html_form_set(check_all = &quot;all&quot;)

resp &lt;- session_submit(x = my_session, form = filled, submit = &quot;download&quot;)

link &lt;- resp %&gt;% 
  read_html() %&gt;% 
  html_element(&quot;p+ section .col-lg-4:nth-child(1) a&quot;) %&gt;% 
  html_attr(&quot;href&quot;)

full_link &lt;- url_absolute(link, url)

next_ses &lt;- my_session %&gt;% 
  session_jump_to(full_link)

writeBin(next_ses$response$content, &quot;Downloads/test_hydat.csv&quot;)

huangapple
  • 本文由 发表于 2023年3月20日 23:55:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75792539.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定