Website forbidden when scraping web data in R but works fine in browser

huangapple go评论70阅读模式
英文:

Website forbidden when scraping web data in R but works fine in browser

问题

I'm trying to import the data here:

https://download.bls.gov/pub/time.series/cu/cu.series

But when I run

fread('https://download.bls.gov/pub/time.series/cu/cu.series')

I get:

Error in curl::curl_download(input, tmpFile, mode = "wb", quiet = !showProgress) :
HTTP error 403.

Update: Still getting error even when using custom user agent in Rstudio Cloud.

英文:

I'm trying to import the data here:

https://download.bls.gov/pub/time.series/cu/cu.series

But when I run

fread('https://download.bls.gov/pub/time.series/cu/cu.series')

I get:

Error in curl::curl_download(input, tmpFile, mode = "wb", quiet = !showProgress) : 
  HTTP error 403.

Update: Still getting error even when using custom user agent in Rstudio Cloud
Website forbidden when scraping web data in R but works fine in browser

答案1

得分: 1

这在BLS时间序列网站对我有效:在URL之前加上"https://",然后在user_agent字符串中插入您的电子邮件地址,而不是浏览器字符串。例如:

GET("https://download.bls.gov/...", user_agent("youremail@domain.name"))
英文:

This worked for me at the BLS time series site: preface the url with "https://", and then for the user_agent string, insert your email address instead of the browser string. Ex:

GET("https://download.bls.gov/...",user_agent("youremail@domain.name"))

答案2

得分: 0

以下是翻译好的部分:

  1. 对于初始失败,我们将使用 httr 和它的 user_agent 进行查询。
  2. 对于后续的 GET(..) 失败,我们将在URL前面添加 "https://",因为否则它将默认为 "http://"(并不是所有网站都会自动重定向端口80到端口443以进行方案升级)。
library(httr)
quux <- GET(url = "download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
quux
# Response [http://download.bls.gov/pub/time.series/cu/cu.series]
#   Date: 2023-05-17 17:23
#   Status: 403
#   Content-Type: text/html
#   Size: 1.32 kB
quux <- GET(url = "https://download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
quux
# Response [https://download.bls.gov/pub/time.series/cu/cu.series]
#   Date: 2023-05-17 17:23
#   Status: 200
#   Content-Type: text/plain
#   Size: 1.34 MB
# series_id        	area_code	item_code	seasonal	periodicity_code	base_code	base_period	series_title	footnote_cod...
# CUSR0000SA0      	0000	SA0	S	R	S	1982-84=100	All items in U.S. city average, all urban consumers, seasonally ad...
# CUSR0000SA0E     	0000	SA0E	S	R	S	1982-84=100	Energy in U.S. city average, all urban consumers, seasonally adju...
# CUSR0000SA0L1    	0000	SA0L1	S	R	S	1982-84=100	All items less food in U.S. city average, all urban consumers, s...
# CUSR0000SA0L12   	0000	SA0L12	S	R	S	1982-84=100	All items less food and shelter in U.S. city average, all urban...
# CUSR0000SA0L12E  	0000	SA0L12E	S	R	S	1982-84=100	All items less food, shelter, and energy in U.S. city average,...
# CUSR0000SA0L12E4 	0000	SA0L12E4	S	R	S	1982-84=100	All items less food, shelter, energy, and used cars and truck...
# CUSR0000SA0L1E   	0000	SA0L1E	S	R	S	1982-84=100	All items less food and energy in U.S. city average, all urban ...
# CUSR0000SA0L2    	0000	SA0L2	S	R	S	1982-84=100	All items less shelter in U.S. city average, all urban consumers...
# CUSR0000SA0L5    	0000	SA0L5	S	R	S	1982-84=100	All items  less medical care in U.S. city average, all urban con...
# ...
fread(content(quux))
# No encoding supplied: defaulting to UTF-8.
#              series_id area_code item_code seasonal periodicity_code base_code       base_period
#                 <char>    <char>    <char>   <char>           <char>    <char>            <char>
#    1:      CUSR0000SA0      0000       SA0        S                R         S       1982-84=100
#    2:     CUSR0000SA0E      0000      SA0E        S                R         S       1982-84=100
#    3:    CUSR0000SA0L1      0000     SA0L1        S                R         S       1982-84=100
#    4:   CUSR0000SA0L12      0000    SA0L12        S                R         S       1982-84=100
#    5:  CUSR0000SA0L12E      0000   SA0L12E        S                R         S       1982-84=100
#    6: CUSR0000SA0L12E4      0000  SA0L12E4        S                R         S       1982-84=100
#    7:   CUSR0000SA0L1E      0000    SA0L1E        S                R         S       1982-84=100
#    8:    CUSR0000SA0L2      0000     SA0L2        S                R         S       1982-84=100
#    9:    CUSR0000SA0L5      0000     SA0L5        S                R         S       1982-84=100
#   10:    CUSR0000SA0LE      0000     SA0LE        S                R         S       1982-84=100
#   ---
# 8090:   CUUSS49GSEHF02      S49G    SEHF02        U                S         S       1982-84=100
# 8091:     CUUSS49GSETA      S49G      SETA        U                S         S DECEMBER 1997=100
# 8092:   CUUSS49GSETA01      S49G    SETA01        U                S         S  JANUARY 1978=100
# 8093:   CUUSS49GSETA02      S49G    SETA02        U                S         S  JANUARY 1978=100
# 8094:     CUUSS49GSETB      S49G      SETB        U                S         S       1982-84=100
# 8095:   CUUSS49GSETB01      S49G    SETB01        U                S         S       1982-84=100
# 8096:     CUUSS49GSETE      S49G      SETE        U                S         S  JANUARY 1978=100
# 809

<details>
<summary>英文:</summary>

Two issues addressed here:

1. For the initial failure, we&#39;ll use `httr` and its `user_agent` for the query.
2. For the subsequent `GET(..)` failure, we&#39;ll prepend `&quot;https://&quot;` to the URL, since it&#39;ll otherwise default to `&quot;http://&quot;` (and not all websites automatically redirect port 80 to port 443 with a scheme upgrade).

```r
library(httr)
quux &lt;- GET(url = &quot;download.bls.gov/pub/time.series/cu/cu.series&quot;, user_agent(&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36&quot;))
quux
# Response [http://download.bls.gov/pub/time.series/cu/cu.series]
#   Date: 2023-05-17 17:23
#   Status: 403
#   Content-Type: text/html
#   Size: 1.32 kB
quux &lt;- GET(url = &quot;https://download.bls.gov/pub/time.series/cu/cu.series&quot;, user_agent(&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36&quot;))
quux
# Response [https://download.bls.gov/pub/time.series/cu/cu.series]
#   Date: 2023-05-17 17:23
#   Status: 200
#   Content-Type: text/plain
#   Size: 1.34 MB
# series_id        	area_code	item_code	seasonal	periodicity_code	base_code	base_period	series_title	footnote_cod...
# CUSR0000SA0      	0000	SA0	S	R	S	1982-84=100	All items in U.S. city average, all urban consumers, seasonally ad...
# CUSR0000SA0E     	0000	SA0E	S	R	S	1982-84=100	Energy in U.S. city average, all urban consumers, seasonally adju...
# CUSR0000SA0L1    	0000	SA0L1	S	R	S	1982-84=100	All items less food in U.S. city average, all urban consumers, s...
# CUSR0000SA0L12   	0000	SA0L12	S	R	S	1982-84=100	All items less food and shelter in U.S. city average, all urban...
# CUSR0000SA0L12E  	0000	SA0L12E	S	R	S	1982-84=100	All items less food, shelter, and energy in U.S. city average,...
# CUSR0000SA0L12E4 	0000	SA0L12E4	S	R	S	1982-84=100	All items less food, shelter, energy, and used cars and truck...
# CUSR0000SA0L1E   	0000	SA0L1E	S	R	S	1982-84=100	All items less food and energy in U.S. city average, all urban ...
# CUSR0000SA0L2    	0000	SA0L2	S	R	S	1982-84=100	All items less shelter in U.S. city average, all urban consumers...
# CUSR0000SA0L5    	0000	SA0L5	S	R	S	1982-84=100	All items  less medical care in U.S. city average, all urban con...
# ...
fread(content(quux))
# No encoding supplied: defaulting to UTF-8.
#              series_id area_code item_code seasonal periodicity_code base_code       base_period
#                 &lt;char&gt;    &lt;char&gt;    &lt;char&gt;   &lt;char&gt;           &lt;char&gt;    &lt;char&gt;            &lt;char&gt;
#    1:      CUSR0000SA0      0000       SA0        S                R         S       1982-84=100
#    2:     CUSR0000SA0E      0000      SA0E        S                R         S       1982-84=100
#    3:    CUSR0000SA0L1      0000     SA0L1        S                R         S       1982-84=100
#    4:   CUSR0000SA0L12      0000    SA0L12        S                R         S       1982-84=100
#    5:  CUSR0000SA0L12E      0000   SA0L12E        S                R         S       1982-84=100
#    6: CUSR0000SA0L12E4      0000  SA0L12E4        S                R         S       1982-84=100
#    7:   CUSR0000SA0L1E      0000    SA0L1E        S                R         S       1982-84=100
#    8:    CUSR0000SA0L2      0000     SA0L2        S                R         S       1982-84=100
#    9:    CUSR0000SA0L5      0000     SA0L5        S                R         S       1982-84=100
#   10:    CUSR0000SA0LE      0000     SA0LE        S                R         S       1982-84=100
#   ---                                                                                           
# 8090:   CUUSS49GSEHF02      S49G    SEHF02        U                S         S       1982-84=100
# 8091:     CUUSS49GSETA      S49G      SETA        U                S         S DECEMBER 1997=100
# 8092:   CUUSS49GSETA01      S49G    SETA01        U                S         S  JANUARY 1978=100
# 8093:   CUUSS49GSETA02      S49G    SETA02        U                S         S  JANUARY 1978=100
# 8094:     CUUSS49GSETB      S49G      SETB        U                S         S       1982-84=100
# 8095:   CUUSS49GSETB01      S49G    SETB01        U                S         S       1982-84=100
# 8096:     CUUSS49GSETE      S49G      SETE        U                S         S  JANUARY 1978=100
# 8097:  CUUSS49GSS47014      S49G   SS47014        U                S         S       1982-84=100
# 8098:  CUUSS49GSS47015      S49G   SS47015        U                S         S DECEMBER 1993=100
# 8099:  CUUSS49GSS47016      S49G   SS47016        U                S         S       1982-84=100
# 6 variables not shown: [series_title &lt;char&gt;, footnote_codes &lt;lgcl&gt;, begin_year &lt;int&gt;, begin_period &lt;char&gt;, end_year &lt;int&gt;, end_period &lt;char&gt;]

huangapple
  • 本文由 发表于 2023年5月17日 22:43:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76273357.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定