为什么连接到一个有效的URL时我会收到403错误?

huangapple go评论94阅读模式
英文:

Why am I getting a 403 error when connecting to a url that works

问题

我正在尝试从SEC政府网站获取一家公司的季度结束日期。由于某种原因,我一直在遇到连接错误。这段代码在我在加拿大的朋友那里无法工作,但在美国的朋友那里可以。我尝试使用VPN,但仍然遇到相同的错误。以下是代码和我遇到的错误信息。

  1. library(derivmkts)
  2. library(quantmod)
  3. library(jsonlite)
  4. library(tidyverse)
  5. url = "https://data.sec.gov/submissions/CIK0000320193.json"
  6. df <- fromJSON(url, flatten = TRUE)
  7. Error in open.connection(con, "rb") :
  8. cannot open the connection to 'https://data.sec.gov/submissions/CIK0000320193.json'
  9. In addition: Warning message:
  10. In open.connection(con, "rb") :
  11. cannot open URL 'https://data.sec.gov/submissions/CIK0000320193.json': HTTP status was '403 Forbidden'

我不希望在连接到此URL时出现403错误。

英文:

I am trying to pull the quarter end dates for a company from the SEC government website. For some reason I keep getting a connection error. The code is working for my friend who is in the US, but not for me in Canada. I tried using a VPN, but was still getting the same error. Here is the code and the error that I was getting.

When I put the url into google it brings me to the page with all the information so I am not sure why I cant pull it into R.

  1. library(derivmkts)
  2. library(quantmod)
  3. library(jsonlite)
  4. library(tidyverse)
  5. url = &quot;https://data.sec.gov/submissions/CIK0000320193.json&quot;
  6. df &lt;- fromJSON(url, flatten = T)
  7. Error in open.connection(con, &quot;rb&quot;) :
  8. cannot open the connection to &#39;https://data.sec.gov/submissions/CIK0000320193.json&#39;
  9. In addition: Warning message:
  10. In open.connection(con, &quot;rb&quot;) :
  11. cannot open URL &#39;https://data.sec.gov/submissions/CIK0000320193.json&#39;: HTTP status was &#39;403 Forbidden&#39;

I am not expecting a 403 error when connecting to this url

答案1

得分: 2

他们要求您在请求标头中声明用户代理 - https://www.sec.gov/os/accessing-edgar-data

显然,提供的示例也被接受,尽管您真的应该在那里提供您的联系方式。

使用 httr2 仍然使用 jsonlite 来解析 JSON 响应:

  1. library(httr2)
  2. resp &lt;- request(&quot;https://data.sec.gov/submissions/CIK0000320193.json&quot;) |&gt;
  3. req_user_agent(&quot;Sample Company Name AdminContact@&lt;sample company domain&gt;.com&quot;) |&gt;
  4. # 设置调试的详细程度,1:显示标头
  5. req_perform(verbosity = 1)
  6. #&gt; -&gt; GET /submissions/CIK0000320193.json HTTP/1.1
  7. #&gt; -&gt; Host: data.sec.gov
  8. #&gt; -&gt; User-Agent: Sample Company Name AdminContact@&lt;sample company domain&gt;.com
  9. #&gt; -&gt; Accept: */*
  10. #&gt; -&gt; Accept-Encoding: deflate, gzip
  11. #&gt; -&gt;
  12. #&gt; &lt;- HTTP/1.1 200 OK
  13. #&gt; &lt;- Content-Type: application/json
  14. #&gt; &lt;- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
  15. #&gt; &lt;- Access-Control-Allow-Origin: *
  16. #&gt; &lt;- x-amz-apigw-id: IvJu4HiHIAMFidw=
  17. #&gt; &lt;- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
  18. #&gt; &lt;- Vary: Accept-Encoding
  19. #&gt; &lt;- Content-Encoding: gzip
  20. #&gt; &lt;- Expires: Thu, 27 Jul 2023 18:51:49 GMT
  21. #&gt; &lt;- Cache-Control: max-age=0, no-cache, no-store
  22. #&gt; &lt;- Pragma: no-cache
  23. #&gt; &lt;- Date: Thu, 27 Jul 2023 18:51:49 GMT
  24. #&gt; &lt;- Content-Length: 28594
  25. #&gt; &lt;- Connection: keep-alive
  26. #&gt; &lt;- Strict-Transport-Security: max-age=31536000 ; preload
  27. #&gt; &lt;- Set-Cookie: ak_bmsc=E9...
  28. resp
  29. #&gt; &lt;httr2_response&gt;
  30. #&gt; GET https://data.sec.gov/submissions/CIK0000320193.json
  31. #&gt; 状态: 200 OK
  32. #&gt; 内容类型: application/json
  33. #&gt; 主体: In memory (157568 字节)
  34. # JSON 中的前几个键/值:
  35. resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |&gt;
  36. head(n = 10) |&gt;
  37. str()
  38. #&gt; List of 10
  39. #&gt; $ cik : chr &quot;320193&quot;
  40. #&gt; $ entityType : chr &quot;operating&quot;
  41. #&gt; $ sic : chr &quot;3571&quot;
  42. #&gt; $ sicDescription : chr &quot;Electronic Computers&quot;
  43. #&gt; $ insiderTransactionForOwnerExists : int 0
  44. #&gt; $ insiderTransactionForIssuerExists: int 1
  45. #&gt; $ name : chr &quot;Apple Inc.&quot;
  46. #&gt; $ tickers : chr &quot;AAPL&quot;
  47. #&gt; $ exchanges : chr &quot;Nasdaq&quot;
  48. #&gt; $ ein : chr &quot;942404110&quot;

创建于2023-07-27,使用 reprex v2.0.2

我来自欧盟,可以在浏览器中打开该 JSON URL,没有任何问题,但默认的 jsonlitehttr2 代理被阻止。只有当我还设置了 accept-language 时,使用浏览器的代理与 httr2 一起工作。当请求不来自浏览器时,他们会检查用户代理中的一些奇怪模式,例如 &quot;foo_bar&quot; - 不可以 / &quot;foo.bar&quot; - 可以。

英文:

They ask you to declare user agent in request headers - https://www.sec.gov/os/accessing-edgar-data

Apparently the one provided as an example is also accepted, though you really should provide your contact details there.

With httr2, it still uses jsonlite for parsing JSON responses:

  1. library(httr2)
  2. resp &lt;- request(&quot;https://data.sec.gov/submissions/CIK0000320193.json&quot;) |&gt;
  3. req_user_agent(&quot;Sample Company Name AdminContact@&lt;sample company domain&gt;.com&quot;) |&gt;
  4. # set verbosity level for debugging, 1: show headers
  5. req_perform(verbosity = 1)
  6. #&gt; -&gt; GET /submissions/CIK0000320193.json HTTP/1.1
  7. #&gt; -&gt; Host: data.sec.gov
  8. #&gt; -&gt; User-Agent: Sample Company Name AdminContact@&lt;sample company domain&gt;.com
  9. #&gt; -&gt; Accept: */*
  10. #&gt; -&gt; Accept-Encoding: deflate, gzip
  11. #&gt; -&gt;
  12. #&gt; &lt;- HTTP/1.1 200 OK
  13. #&gt; &lt;- Content-Type: application/json
  14. #&gt; &lt;- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
  15. #&gt; &lt;- Access-Control-Allow-Origin: *
  16. #&gt; &lt;- x-amz-apigw-id: IvJu4HiHIAMFidw=
  17. #&gt; &lt;- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
  18. #&gt; &lt;- Vary: Accept-Encoding
  19. #&gt; &lt;- Content-Encoding: gzip
  20. #&gt; &lt;- Expires: Thu, 27 Jul 2023 18:51:49 GMT
  21. #&gt; &lt;- Cache-Control: max-age=0, no-cache, no-store
  22. #&gt; &lt;- Pragma: no-cache
  23. #&gt; &lt;- Date: Thu, 27 Jul 2023 18:51:49 GMT
  24. #&gt; &lt;- Content-Length: 28594
  25. #&gt; &lt;- Connection: keep-alive
  26. #&gt; &lt;- Strict-Transport-Security: max-age=31536000 ; preload
  27. #&gt; &lt;- Set-Cookie: ak_bmsc=E9...
  28. resp
  29. #&gt; &lt;httr2_response&gt;
  30. #&gt; GET https://data.sec.gov/submissions/CIK0000320193.json
  31. #&gt; Status: 200 OK
  32. #&gt; Content-Type: application/json
  33. #&gt; Body: In memory (157568 bytes)
  34. # first few keys / values from JSON:
  35. resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |&gt;
  36. head(n = 10) |&gt;
  37. str()
  38. #&gt; List of 10
  39. #&gt; $ cik : chr &quot;320193&quot;
  40. #&gt; $ entityType : chr &quot;operating&quot;
  41. #&gt; $ sic : chr &quot;3571&quot;
  42. #&gt; $ sicDescription : chr &quot;Electronic Computers&quot;
  43. #&gt; $ insiderTransactionForOwnerExists : int 0
  44. #&gt; $ insiderTransactionForIssuerExists: int 1
  45. #&gt; $ name : chr &quot;Apple Inc.&quot;
  46. #&gt; $ tickers : chr &quot;AAPL&quot;
  47. #&gt; $ exchanges : chr &quot;Nasdaq&quot;
  48. #&gt; $ ein : chr &quot;942404110&quot;

<sup>Created on 2023-07-27 with reprex v2.0.2</sup>

I'm from EU, I can open that JSON URL in the browser without any issues, but default jsonlite & httr2 agents are blocked. Using my browser's agent with httr2 works only when I also set accept-language. They check for some weird pattern in user agent when request is not coming from browser,
i.e. &quot;foo_bar&quot; - NOK / &quot;foo.bar&quot; - OK

huangapple
  • 本文由 发表于 2023年7月28日 01:22:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76782132.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定