Website forbidden when scraping web data in R but works fine in browser

huangapple go评论101阅读模式
英文:

Website forbidden when scraping web data in R but works fine in browser

问题

I'm trying to import the data here:

https://download.bls.gov/pub/time.series/cu/cu.series

But when I run

fread('https://download.bls.gov/pub/time.series/cu/cu.series')

I get:

Error in curl::curl_download(input, tmpFile, mode = "wb", quiet = !showProgress) :
HTTP error 403.

Update: Still getting error even when using custom user agent in Rstudio Cloud.

英文:

I'm trying to import the data here:

https://download.bls.gov/pub/time.series/cu/cu.series

But when I run

  1. fread('https://download.bls.gov/pub/time.series/cu/cu.series')

I get:

  1. Error in curl::curl_download(input, tmpFile, mode = "wb", quiet = !showProgress) :
  2. HTTP error 403.

Update: Still getting error even when using custom user agent in Rstudio Cloud
Website forbidden when scraping web data in R but works fine in browser

答案1

得分: 1

这在BLS时间序列网站对我有效:在URL之前加上"https://",然后在user_agent字符串中插入您的电子邮件地址,而不是浏览器字符串。例如:

  1. GET("https://download.bls.gov/...", user_agent("youremail@domain.name"))
英文:

This worked for me at the BLS time series site: preface the url with "https://", and then for the user_agent string, insert your email address instead of the browser string. Ex:

  1. GET("https://download.bls.gov/...",user_agent("youremail@domain.name"))

答案2

得分: 0

以下是翻译好的部分:

  1. 对于初始失败,我们将使用 httr 和它的 user_agent 进行查询。
  2. 对于后续的 GET(..) 失败,我们将在URL前面添加 "https://",因为否则它将默认为 "http://"(并不是所有网站都会自动重定向端口80到端口443以进行方案升级)。
  1. library(httr)
  2. quux <- GET(url = "download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
  3. quux
  4. # Response [http://download.bls.gov/pub/time.series/cu/cu.series]
  5. # Date: 2023-05-17 17:23
  6. # Status: 403
  7. # Content-Type: text/html
  8. # Size: 1.32 kB
  9. quux <- GET(url = "https://download.bls.gov/pub/time.series/cu/cu.series", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"))
  10. quux
  11. # Response [https://download.bls.gov/pub/time.series/cu/cu.series]
  12. # Date: 2023-05-17 17:23
  13. # Status: 200
  14. # Content-Type: text/plain
  15. # Size: 1.34 MB
  16. # series_id area_code item_code seasonal periodicity_code base_code base_period series_title footnote_cod...
  17. # CUSR0000SA0 0000 SA0 S R S 1982-84=100 All items in U.S. city average, all urban consumers, seasonally ad...
  18. # CUSR0000SA0E 0000 SA0E S R S 1982-84=100 Energy in U.S. city average, all urban consumers, seasonally adju...
  19. # CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100 All items less food in U.S. city average, all urban consumers, s...
  20. # CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100 All items less food and shelter in U.S. city average, all urban...
  21. # CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100 All items less food, shelter, and energy in U.S. city average,...
  22. # CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100 All items less food, shelter, energy, and used cars and truck...
  23. # CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100 All items less food and energy in U.S. city average, all urban ...
  24. # CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100 All items less shelter in U.S. city average, all urban consumers...
  25. # CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100 All items less medical care in U.S. city average, all urban con...
  26. # ...
  27. fread(content(quux))
  28. # No encoding supplied: defaulting to UTF-8.
  29. # series_id area_code item_code seasonal periodicity_code base_code base_period
  30. # <char> <char> <char> <char> <char> <char> <char>
  31. # 1: CUSR0000SA0 0000 SA0 S R S 1982-84=100
  32. # 2: CUSR0000SA0E 0000 SA0E S R S 1982-84=100
  33. # 3: CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100
  34. # 4: CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100
  35. # 5: CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100
  36. # 6: CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100
  37. # 7: CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100
  38. # 8: CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100
  39. # 9: CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100
  40. # 10: CUSR0000SA0LE 0000 SA0LE S R S 1982-84=100
  41. # ---
  42. # 8090: CUUSS49GSEHF02 S49G SEHF02 U S S 1982-84=100
  43. # 8091: CUUSS49GSETA S49G SETA U S S DECEMBER 1997=100
  44. # 8092: CUUSS49GSETA01 S49G SETA01 U S S JANUARY 1978=100
  45. # 8093: CUUSS49GSETA02 S49G SETA02 U S S JANUARY 1978=100
  46. # 8094: CUUSS49GSETB S49G SETB U S S 1982-84=100
  47. # 8095: CUUSS49GSETB01 S49G SETB01 U S S 1982-84=100
  48. # 8096: CUUSS49GSETE S49G SETE U S S JANUARY 1978=100
  49. # 809
  50. <details>
  51. <summary>英文:</summary>
  52. Two issues addressed here:
  53. 1. For the initial failure, we&#39;ll use `httr` and its `user_agent` for the query.
  54. 2. For the subsequent `GET(..)` failure, we&#39;ll prepend `&quot;https://&quot;` to the URL, since it&#39;ll otherwise default to `&quot;http://&quot;` (and not all websites automatically redirect port 80 to port 443 with a scheme upgrade).
  55. ```r
  56. library(httr)
  57. quux &lt;- GET(url = &quot;download.bls.gov/pub/time.series/cu/cu.series&quot;, user_agent(&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36&quot;))
  58. quux
  59. # Response [http://download.bls.gov/pub/time.series/cu/cu.series]
  60. # Date: 2023-05-17 17:23
  61. # Status: 403
  62. # Content-Type: text/html
  63. # Size: 1.32 kB
  64. quux &lt;- GET(url = &quot;https://download.bls.gov/pub/time.series/cu/cu.series&quot;, user_agent(&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36&quot;))
  65. quux
  66. # Response [https://download.bls.gov/pub/time.series/cu/cu.series]
  67. # Date: 2023-05-17 17:23
  68. # Status: 200
  69. # Content-Type: text/plain
  70. # Size: 1.34 MB
  71. # series_id area_code item_code seasonal periodicity_code base_code base_period series_title footnote_cod...
  72. # CUSR0000SA0 0000 SA0 S R S 1982-84=100 All items in U.S. city average, all urban consumers, seasonally ad...
  73. # CUSR0000SA0E 0000 SA0E S R S 1982-84=100 Energy in U.S. city average, all urban consumers, seasonally adju...
  74. # CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100 All items less food in U.S. city average, all urban consumers, s...
  75. # CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100 All items less food and shelter in U.S. city average, all urban...
  76. # CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100 All items less food, shelter, and energy in U.S. city average,...
  77. # CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100 All items less food, shelter, energy, and used cars and truck...
  78. # CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100 All items less food and energy in U.S. city average, all urban ...
  79. # CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100 All items less shelter in U.S. city average, all urban consumers...
  80. # CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100 All items less medical care in U.S. city average, all urban con...
  81. # ...
  82. fread(content(quux))
  83. # No encoding supplied: defaulting to UTF-8.
  84. # series_id area_code item_code seasonal periodicity_code base_code base_period
  85. # &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt;
  86. # 1: CUSR0000SA0 0000 SA0 S R S 1982-84=100
  87. # 2: CUSR0000SA0E 0000 SA0E S R S 1982-84=100
  88. # 3: CUSR0000SA0L1 0000 SA0L1 S R S 1982-84=100
  89. # 4: CUSR0000SA0L12 0000 SA0L12 S R S 1982-84=100
  90. # 5: CUSR0000SA0L12E 0000 SA0L12E S R S 1982-84=100
  91. # 6: CUSR0000SA0L12E4 0000 SA0L12E4 S R S 1982-84=100
  92. # 7: CUSR0000SA0L1E 0000 SA0L1E S R S 1982-84=100
  93. # 8: CUSR0000SA0L2 0000 SA0L2 S R S 1982-84=100
  94. # 9: CUSR0000SA0L5 0000 SA0L5 S R S 1982-84=100
  95. # 10: CUSR0000SA0LE 0000 SA0LE S R S 1982-84=100
  96. # ---
  97. # 8090: CUUSS49GSEHF02 S49G SEHF02 U S S 1982-84=100
  98. # 8091: CUUSS49GSETA S49G SETA U S S DECEMBER 1997=100
  99. # 8092: CUUSS49GSETA01 S49G SETA01 U S S JANUARY 1978=100
  100. # 8093: CUUSS49GSETA02 S49G SETA02 U S S JANUARY 1978=100
  101. # 8094: CUUSS49GSETB S49G SETB U S S 1982-84=100
  102. # 8095: CUUSS49GSETB01 S49G SETB01 U S S 1982-84=100
  103. # 8096: CUUSS49GSETE S49G SETE U S S JANUARY 1978=100
  104. # 8097: CUUSS49GSS47014 S49G SS47014 U S S 1982-84=100
  105. # 8098: CUUSS49GSS47015 S49G SS47015 U S S DECEMBER 1993=100
  106. # 8099: CUUSS49GSS47016 S49G SS47016 U S S 1982-84=100
  107. # 6 variables not shown: [series_title &lt;char&gt;, footnote_codes &lt;lgcl&gt;, begin_year &lt;int&gt;, begin_period &lt;char&gt;, end_year &lt;int&gt;, end_period &lt;char&gt;]

huangapple
  • 本文由 发表于 2023年5月17日 22:43:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76273357.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定