英文:
Website forbidden when scraping web data in R but works fine in browser
问题
I'm trying to import the data here:
https://download.bls.gov/pub/time.series/cu/cu.series
But when I run
fread('https://download.bls.gov/pub/time.series/cu/cu.series')
I get:
Error in curl::curl_download(input, tmpFile, mode = "wb", quiet = !showProgress) :
HTTP error 403.
Update: Still getting error even when using custom user agent in Rstudio Cloud.
英文:
I'm trying to import the data here:
https://download.bls.gov/pub/time.series/cu/cu.series
But when I run
fread('https://download.bls.gov/pub/time.series/cu/cu.series')
I get:
Error in curl::curl_download(input, tmpFile, mode = "wb", quiet = !showProgress) :
HTTP error 403.
Update: Still getting error even when using custom user agent in Rstudio Cloud
答案1
得分: 1
这在BLS时间序列网站对我有效:在URL之前加上"https://",然后在user_agent字符串中插入您的电子邮件地址,而不是浏览器字符串。例如:
GET("https://download.bls.gov/...", user_agent("youremail@domain.name"))
英文:
This worked for me at the BLS time series site: preface the url with "https://"
, and then for the user_agent string, insert your email address instead of the browser string. Ex:
GET("https://download.bls.gov/...",user_agent("youremail@domain.name"))
答案2
得分: 0
以下是翻译好的部分:
- 对于初始失败,我们将使用
httr
和它的user_agent
进行查询。 - 对于后续的
GET(..)
失败,我们将在URL前面添加"https://"
,因为否则它将默认为"http://"
(并不是所有网站都会自动重定向端口80到端口443以进行方案升级)。
library(httr)
quux <- GET(url = "download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
quux
# Response [http://download.bls.gov/pub/time.series/cu/cu.series]
# Date: 2023-05-17 17:23
# Status: 403
# Content-Type: text/html
# Size: 1.32 kB
quux <- GET(url = "https://download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
quux
# Response [https://download.bls.gov/pub/time.series/cu/cu.series]
# Date: 2023-05-17 17:23
# Status: 200
# Content-Type: text/plain
# Size: 1.34 MB
# series_id area_code item_code seasonal periodicity_code base_code base_period series_title footnote_cod...
# CUSR0000SA0 0000 SA0 S R S 1982-84=100 All items in U.S. city average, all urban consumers, seasonally ad...
# CUSR0000SA0E 0000 SA0E S R S 1982-84=100 Energy in U.S. city average, all urban consumers, seasonally adju...
# CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100 All items less food in U.S. city average, all urban consumers, s...
# CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100 All items less food and shelter in U.S. city average, all urban...
# CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100 All items less food, shelter, and energy in U.S. city average,...
# CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100 All items less food, shelter, energy, and used cars and truck...
# CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100 All items less food and energy in U.S. city average, all urban ...
# CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100 All items less shelter in U.S. city average, all urban consumers...
# CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100 All items less medical care in U.S. city average, all urban con...
# ...
fread(content(quux))
# No encoding supplied: defaulting to UTF-8.
# series_id area_code item_code seasonal periodicity_code base_code base_period
# <char> <char> <char> <char> <char> <char> <char>
# 1: CUSR0000SA0 0000 SA0 S R S 1982-84=100
# 2: CUSR0000SA0E 0000 SA0E S R S 1982-84=100
# 3: CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100
# 4: CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100
# 5: CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100
# 6: CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100
# 7: CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100
# 8: CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100
# 9: CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100
# 10: CUSR0000SA0LE 0000 SA0LE S R S 1982-84=100
# ---
# 8090: CUUSS49GSEHF02 S49G SEHF02 U S S 1982-84=100
# 8091: CUUSS49GSETA S49G SETA U S S DECEMBER 1997=100
# 8092: CUUSS49GSETA01 S49G SETA01 U S S JANUARY 1978=100
# 8093: CUUSS49GSETA02 S49G SETA02 U S S JANUARY 1978=100
# 8094: CUUSS49GSETB S49G SETB U S S 1982-84=100
# 8095: CUUSS49GSETB01 S49G SETB01 U S S 1982-84=100
# 8096: CUUSS49GSETE S49G SETE U S S JANUARY 1978=100
# 809
<details>
<summary>英文:</summary>
Two issues addressed here:
1. For the initial failure, we'll use `httr` and its `user_agent` for the query.
2. For the subsequent `GET(..)` failure, we'll prepend `"https://"` to the URL, since it'll otherwise default to `"http://"` (and not all websites automatically redirect port 80 to port 443 with a scheme upgrade).
```r
library(httr)
quux <- GET(url = "download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
quux
# Response [http://download.bls.gov/pub/time.series/cu/cu.series]
# Date: 2023-05-17 17:23
# Status: 403
# Content-Type: text/html
# Size: 1.32 kB
quux <- GET(url = "https://download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
quux
# Response [https://download.bls.gov/pub/time.series/cu/cu.series]
# Date: 2023-05-17 17:23
# Status: 200
# Content-Type: text/plain
# Size: 1.34 MB
# series_id area_code item_code seasonal periodicity_code base_code base_period series_title footnote_cod...
# CUSR0000SA0 0000 SA0 S R S 1982-84=100 All items in U.S. city average, all urban consumers, seasonally ad...
# CUSR0000SA0E 0000 SA0E S R S 1982-84=100 Energy in U.S. city average, all urban consumers, seasonally adju...
# CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100 All items less food in U.S. city average, all urban consumers, s...
# CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100 All items less food and shelter in U.S. city average, all urban...
# CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100 All items less food, shelter, and energy in U.S. city average,...
# CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100 All items less food, shelter, energy, and used cars and truck...
# CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100 All items less food and energy in U.S. city average, all urban ...
# CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100 All items less shelter in U.S. city average, all urban consumers...
# CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100 All items less medical care in U.S. city average, all urban con...
# ...
fread(content(quux))
# No encoding supplied: defaulting to UTF-8.
# series_id area_code item_code seasonal periodicity_code base_code base_period
# <char> <char> <char> <char> <char> <char> <char>
# 1: CUSR0000SA0 0000 SA0 S R S 1982-84=100
# 2: CUSR0000SA0E 0000 SA0E S R S 1982-84=100
# 3: CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100
# 4: CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100
# 5: CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100
# 6: CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100
# 7: CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100
# 8: CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100
# 9: CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100
# 10: CUSR0000SA0LE 0000 SA0LE S R S 1982-84=100
# ---
# 8090: CUUSS49GSEHF02 S49G SEHF02 U S S 1982-84=100
# 8091: CUUSS49GSETA S49G SETA U S S DECEMBER 1997=100
# 8092: CUUSS49GSETA01 S49G SETA01 U S S JANUARY 1978=100
# 8093: CUUSS49GSETA02 S49G SETA02 U S S JANUARY 1978=100
# 8094: CUUSS49GSETB S49G SETB U S S 1982-84=100
# 8095: CUUSS49GSETB01 S49G SETB01 U S S 1982-84=100
# 8096: CUUSS49GSETE S49G SETE U S S JANUARY 1978=100
# 8097: CUUSS49GSS47014 S49G SS47014 U S S 1982-84=100
# 8098: CUUSS49GSS47015 S49G SS47015 U S S DECEMBER 1993=100
# 8099: CUUSS49GSS47016 S49G SS47016 U S S 1982-84=100
# 6 variables not shown: [series_title <char>, footnote_codes <lgcl>, begin_year <int>, begin_period <char>, end_year <int>, end_period <char>]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论