英文:
rvest - navigate site and download Canada hydrometric data
问题
I am creating an R function that takes a station number, navigates the Canada Hydrometric, and downloads all data for this station. I'm encountering a few problems and they may be due to the radio buttons and/or that the search button isn't named. This is what I have:
station_number <- "08NM083"
url <- "https://wateroffice.ec.gc.ca/search/historical_e.html"
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
my_session <- session(url, user_a)
form <- html_form(my_session)[[2]]
which gives:
<form> 'search-form' (GET https://wateroffice.ec.gc.ca/search/historical_results_e.html)
<field> (submit) : Search
<field> (radio) search_type: station_name
<field> (text) station_name:
<field> (radio) search_type: station_number
<field> (text) station_number:
<field> (radio) search_type: province
<field> (select) province: AB
<field> (radio) search_type: basin
<field> (select) basin:
<field> (radio) search_type: region
<field> (select) region: ATL
<field> (radio) search_type: coordinate
<field> (number) north_degrees:
<field> (number) north_minutes:
<field> (number) north_seconds:
<field> (number) south_degrees:
<field> (number) south_minutes:
<field> (number) south_seconds:
<field> (number) east_degrees:
<field> (number) east_minutes:
<field> (number) east_seconds:
<field> (number) west_degrees:
<field> (number) west_minutes:
<field> (number) west_seconds:
<field> (select) parameter_type: all
<field> (number) start_year: 1850
<field> (number) end_year: 2023
<field> (number) minimum_years:
<field> (checkbox) latest_year: Y
<field> (select) regulation: all
<field> (select) station_status: all
<field> (select) operation_schedule:
<field> (select) contributing_agency: all
<field> (select) gross_drainage_operator: >
<field> (number) gross_drainage_area:
<field> (select) effective_drainage_operator: >
<field> (number) effective_drainage_area:
<field> (select) sediment: ---
<field> (select) real_time: ---
<field> (select) rhbn: ---
<field> (select) contributed: ---
<field> (submit) : Search
When I fill out the form and submit, however, nothing seems to have changed.
filled <- form %>%
html_form_set(station_number = station_number,
search_type = "station_number")
resp <- session_submit(x = my_session, form = filled)
my_session
and resp
:
> my_session
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
Status: 200
Type: text/html; charset=UTF-8
Size: 45034
> resp
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
Status: 200
Type: text/html; charset=UTF-8
Size: 45284
Any suggestions are welcomed!
Edit
kaliiiiiiiii's suggestion of pasting in the station number into the url has worked wonderfully for this part of my problem! I still cannot figure out how to download the csv file.
Current code:
station_number <- "08NM083"
url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=",
station_number,
"&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
my_session <- session(url, user_a)
form <- html_form(my_session)[[2]]
filled <- form %>%
html_form_set(check_all = "all")
resp <- session_submit(x = my_session, form = filled, submit = "download")
resp
link <- resp %>%
read_html() %>%
html_element("p+ section .col-lg-4:nth-child(1) a") %>%
html_attr("href")
full_link <- url_absolute(link, url)
And my attempts at downloading the file:
download.file(full_link, destfile = "Downloads/test_hydat.csv")
test <- read_csv(full_link)
The two above contain only html code.
英文:
I am creating an R function that takes a station number, navigates the Canada Hydrometric, and downloads all data for this station. I'm encountering a few problems and they may be due to the radio buttons and/or that the search button isn't named. This is what I have:
station_number <- "08NM083"
url <- "https://wateroffice.ec.gc.ca/search/historical_e.html"
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
my_session <- session(url, user_a)
form <- html_form(my_session)[[2]]
which gives:
<form> 'search-form' (GET https://wateroffice.ec.gc.ca/search/historical_results_e.html)
<field> (submit) : Search
<field> (radio) search_type: station_name
<field> (text) station_name:
<field> (radio) search_type: station_number
<field> (text) station_number:
<field> (radio) search_type: province
<field> (select) province: AB
<field> (radio) search_type: basin
<field> (select) basin:
<field> (radio) search_type: region
<field> (select) region: ATL
<field> (radio) search_type: coordinate
<field> (number) north_degrees:
<field> (number) north_minutes:
<field> (number) north_seconds:
<field> (number) south_degrees:
<field> (number) south_minutes:
<field> (number) south_seconds:
<field> (number) east_degrees:
<field> (number) east_minutes:
<field> (number) east_seconds:
<field> (number) west_degrees:
<field> (number) west_minutes:
<field> (number) west_seconds:
<field> (select) parameter_type: all
<field> (number) start_year: 1850
<field> (number) end_year: 2023
<field> (number) minimum_years:
<field> (checkbox) latest_year: Y
<field> (select) regulation: all
<field> (select) station_status: all
<field> (select) operation_schedule:
<field> (select) contributing_agency: all
<field> (select) gross_drainage_operator: >
<field> (number) gross_drainage_area:
<field> (select) effective_drainage_operator: >
<field> (number) effective_drainage_area:
<field> (select) sediment: ---
<field> (select) real_time: ---
<field> (select) rhbn: ---
<field> (select) contributed: ---
<field> (submit) : Search
When I fill out the form and submit, however, nothing seems to have changed.
filled <- form %>%
html_form_set(station_number = station_number,
search_type = "station_number")
resp <- session_submit(x = my_session, form = filled)
my_session
and resp
:
> my_session
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
Status: 200
Type: text/html; charset=UTF-8
Size: 45034
> resp
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
Status: 200
Type: text/html; charset=UTF-8
Size: 45284
Any suggestions are welcomed!
Edit
kaliiiiiiiii's suggestion of pasting in the station number into the url has worked wonderfully for this part of my problem! I still cannot figure out how to download the csv file.
Current code:
station_number <- "08NM083"
url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=",
station_number,
"&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
my_session <- session(url, user_a)
form <- html_form(my_session)[[2]]
filled <- form %>%
html_form_set(check_all = "all")
resp <- session_submit(x = my_session, form = filled, submit = "download")
resp
link <- resp %>%
read_html() %>%
html_element("p+ section .col-lg-4:nth-child(1) a") %>%
html_attr("href")
full_link <- url_absolute(link, url)
And my attempts at downloading the file:
download.file(full_link, destfile = "Downloads/test_hydat.csv")
test <- read_csv(full_link)
The two above contain only html code.
答案1
得分: 0
为什么不直接使用 API:
curl 'https://wateroffice.ec.gc.ca/services/map_data?data_type=historical' \
-H 'Accept: */*' \
-H 'Accept-Language: de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4,es;q=0.3' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'DNT: 1' \
-H 'Pragma: no-cache' \
-H 'Referer: https://wateroffice.ec.gc.ca/map/index_e.html?type=historical' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44' \
-H 'X-Requested-With: XMLHttpRequest' \
-H 'sec-ch-ua: "Microsoft Edge";v="111", "Not(A:Brand";v="8", "Chromium";v="111"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"' \
--compressed
获取所有站点?
对于其他编程语言,请使用 curlconverter 进行转换。
或者你可以直接使用以下 URL 进行搜索:
station_name = "teststation"
url = "https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_name&station_name=" + station_name + "&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area="
英文:
Why don't you just use directly the api:
curl 'https://wateroffice.ec.gc.ca/services/map_data?data_type=historical' \
-H 'Accept: */*' \
-H 'Accept-Language: de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4,es;q=0.3' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'DNT: 1' \
-H 'Pragma: no-cache' \
-H 'Referer: https://wateroffice.ec.gc.ca/map/index_e.html?type=historical' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44' \
-H 'X-Requested-With: XMLHttpRequest' \
-H 'sec-ch-ua: "Microsoft Edge";v="111", "Not(A:Brand";v="8", "Chromium";v="111"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"' \
--compressed
To get all the stations?
For other programming languages, convert with curlconverter
Or you can search directly using the url:
station_name = "teststation"
url = "https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_name&station_name="+station_name+"&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area="
答案2
得分: 0
已解决!我需要跳转到“下载 CSV” 链接,并具体提取新会话的响应内容。以下是为需要执行类似操作的任何人提供的完整代码:
station_number <- "08NM083"
url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=",
station_number,
"&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
my_session <- session(url, user_a)
form <- html_form(my_session)[[2]]
filled <- form %>%
html_form_set(check_all = "all")
resp <- session_submit(x = my_session, form = filled, submit = "download")
link <- resp %>%
read_html() %>%
html_element("p+ section .col-lg-4:nth-child(1) a") %>%
html_attr("href")
full_link <- url_absolute(link, url)
next_ses <- my_session %>%
session_jump_to(full_link)
writeBin(next_ses$response$content, "Downloads/test_hydat.csv")
英文:
Figured it out! I needed to jump to the "download csv" link and specifically pull the new session's response content. Full code below for anyone who needs to do something similar:
station_number <- "08NM083"
url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=",
station_number,
"&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
my_session <- session(url, user_a)
form <- html_form(my_session)[[2]]
filled <- form %>%
html_form_set(check_all = "all")
resp <- session_submit(x = my_session, form = filled, submit = "download")
link <- resp %>%
read_html() %>%
html_element("p+ section .col-lg-4:nth-child(1) a") %>%
html_attr("href")
full_link <- url_absolute(link, url)
next_ses <- my_session %>%
session_jump_to(full_link)
writeBin(next_ses$response$content, "Downloads/test_hydat.csv")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论