为什么连接到一个有效的URL时我会收到403错误?

huangapple go评论65阅读模式
英文:

Why am I getting a 403 error when connecting to a url that works

问题

我正在尝试从SEC政府网站获取一家公司的季度结束日期。由于某种原因,我一直在遇到连接错误。这段代码在我在加拿大的朋友那里无法工作,但在美国的朋友那里可以。我尝试使用VPN,但仍然遇到相同的错误。以下是代码和我遇到的错误信息。

library(derivmkts)
library(quantmod)
library(jsonlite)
library(tidyverse)

url = "https://data.sec.gov/submissions/CIK0000320193.json"
df <- fromJSON(url, flatten = TRUE)

Error in open.connection(con, "rb") : 
  cannot open the connection to 'https://data.sec.gov/submissions/CIK0000320193.json'
In addition: Warning message:
In open.connection(con, "rb") :
  cannot open URL 'https://data.sec.gov/submissions/CIK0000320193.json': HTTP status was '403 Forbidden'

我不希望在连接到此URL时出现403错误。

英文:

I am trying to pull the quarter end dates for a company from the SEC government website. For some reason I keep getting a connection error. The code is working for my friend who is in the US, but not for me in Canada. I tried using a VPN, but was still getting the same error. Here is the code and the error that I was getting.

When I put the url into google it brings me to the page with all the information so I am not sure why I cant pull it into R.

library(derivmkts)
library(quantmod)
library(jsonlite)
library(tidyverse)

url = &quot;https://data.sec.gov/submissions/CIK0000320193.json&quot;
df &lt;- fromJSON(url, flatten = T)

Error in open.connection(con, &quot;rb&quot;) : 
  cannot open the connection to &#39;https://data.sec.gov/submissions/CIK0000320193.json&#39;
In addition: Warning message:
In open.connection(con, &quot;rb&quot;) :
  cannot open URL &#39;https://data.sec.gov/submissions/CIK0000320193.json&#39;: HTTP status was &#39;403 Forbidden&#39;

I am not expecting a 403 error when connecting to this url

答案1

得分: 2

他们要求您在请求标头中声明用户代理 - https://www.sec.gov/os/accessing-edgar-data

显然,提供的示例也被接受,尽管您真的应该在那里提供您的联系方式。

使用 httr2 仍然使用 jsonlite 来解析 JSON 响应:

library(httr2)

resp &lt;- request(&quot;https://data.sec.gov/submissions/CIK0000320193.json&quot;) |&gt;
  req_user_agent(&quot;Sample Company Name AdminContact@&lt;sample company domain&gt;.com&quot;) |&gt;
  # 设置调试的详细程度,1:显示标头
  req_perform(verbosity = 1)
#&gt; -&gt; GET /submissions/CIK0000320193.json HTTP/1.1
#&gt; -&gt; Host: data.sec.gov
#&gt; -&gt; User-Agent: Sample Company Name AdminContact@&lt;sample company domain&gt;.com
#&gt; -&gt; Accept: */*
#&gt; -&gt; Accept-Encoding: deflate, gzip
#&gt; -&gt; 
#&gt; &lt;- HTTP/1.1 200 OK
#&gt; &lt;- Content-Type: application/json
#&gt; &lt;- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
#&gt; &lt;- Access-Control-Allow-Origin: *
#&gt; &lt;- x-amz-apigw-id: IvJu4HiHIAMFidw=
#&gt; &lt;- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
#&gt; &lt;- Vary: Accept-Encoding
#&gt; &lt;- Content-Encoding: gzip
#&gt; &lt;- Expires: Thu, 27 Jul 2023 18:51:49 GMT
#&gt; &lt;- Cache-Control: max-age=0, no-cache, no-store
#&gt; &lt;- Pragma: no-cache
#&gt; &lt;- Date: Thu, 27 Jul 2023 18:51:49 GMT
#&gt; &lt;- Content-Length: 28594
#&gt; &lt;- Connection: keep-alive
#&gt; &lt;- Strict-Transport-Security: max-age=31536000 ; preload
#&gt; &lt;- Set-Cookie: ak_bmsc=E9...

resp
#&gt; &lt;httr2_response&gt;
#&gt; GET https://data.sec.gov/submissions/CIK0000320193.json
#&gt; 状态: 200 OK
#&gt; 内容类型: application/json
#&gt; 主体: In memory (157568 字节)

# JSON 中的前几个键/值:
resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |&gt;
  head(n = 10) |&gt;
  str()
#&gt; List of 10
#&gt;  $ cik                              : chr &quot;320193&quot;
#&gt;  $ entityType                       : chr &quot;operating&quot;
#&gt;  $ sic                              : chr &quot;3571&quot;
#&gt;  $ sicDescription                   : chr &quot;Electronic Computers&quot;
#&gt;  $ insiderTransactionForOwnerExists : int 0
#&gt;  $ insiderTransactionForIssuerExists: int 1
#&gt;  $ name                             : chr &quot;Apple Inc.&quot;
#&gt;  $ tickers                          : chr &quot;AAPL&quot;
#&gt;  $ exchanges                        : chr &quot;Nasdaq&quot;
#&gt;  $ ein                              : chr &quot;942404110&quot;

创建于2023-07-27,使用 reprex v2.0.2

我来自欧盟,可以在浏览器中打开该 JSON URL,没有任何问题,但默认的 jsonlitehttr2 代理被阻止。只有当我还设置了 accept-language 时,使用浏览器的代理与 httr2 一起工作。当请求不来自浏览器时,他们会检查用户代理中的一些奇怪模式,例如 &quot;foo_bar&quot; - 不可以 / &quot;foo.bar&quot; - 可以。

英文:

They ask you to declare user agent in request headers - https://www.sec.gov/os/accessing-edgar-data

Apparently the one provided as an example is also accepted, though you really should provide your contact details there.

With httr2, it still uses jsonlite for parsing JSON responses:

library(httr2)

resp &lt;- request(&quot;https://data.sec.gov/submissions/CIK0000320193.json&quot;) |&gt;
  req_user_agent(&quot;Sample Company Name AdminContact@&lt;sample company domain&gt;.com&quot;) |&gt;
  # set verbosity level for debugging, 1: show headers
  req_perform(verbosity = 1)
#&gt; -&gt; GET /submissions/CIK0000320193.json HTTP/1.1
#&gt; -&gt; Host: data.sec.gov
#&gt; -&gt; User-Agent: Sample Company Name AdminContact@&lt;sample company domain&gt;.com
#&gt; -&gt; Accept: */*
#&gt; -&gt; Accept-Encoding: deflate, gzip
#&gt; -&gt; 
#&gt; &lt;- HTTP/1.1 200 OK
#&gt; &lt;- Content-Type: application/json
#&gt; &lt;- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
#&gt; &lt;- Access-Control-Allow-Origin: *
#&gt; &lt;- x-amz-apigw-id: IvJu4HiHIAMFidw=
#&gt; &lt;- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
#&gt; &lt;- Vary: Accept-Encoding
#&gt; &lt;- Content-Encoding: gzip
#&gt; &lt;- Expires: Thu, 27 Jul 2023 18:51:49 GMT
#&gt; &lt;- Cache-Control: max-age=0, no-cache, no-store
#&gt; &lt;- Pragma: no-cache
#&gt; &lt;- Date: Thu, 27 Jul 2023 18:51:49 GMT
#&gt; &lt;- Content-Length: 28594
#&gt; &lt;- Connection: keep-alive
#&gt; &lt;- Strict-Transport-Security: max-age=31536000 ; preload
#&gt; &lt;- Set-Cookie: ak_bmsc=E9...

resp
#&gt; &lt;httr2_response&gt;
#&gt; GET https://data.sec.gov/submissions/CIK0000320193.json
#&gt; Status: 200 OK
#&gt; Content-Type: application/json
#&gt; Body: In memory (157568 bytes)

# first few keys / values from JSON:
resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |&gt;
  head(n = 10) |&gt;
  str()
#&gt; List of 10
#&gt;  $ cik                              : chr &quot;320193&quot;
#&gt;  $ entityType                       : chr &quot;operating&quot;
#&gt;  $ sic                              : chr &quot;3571&quot;
#&gt;  $ sicDescription                   : chr &quot;Electronic Computers&quot;
#&gt;  $ insiderTransactionForOwnerExists : int 0
#&gt;  $ insiderTransactionForIssuerExists: int 1
#&gt;  $ name                             : chr &quot;Apple Inc.&quot;
#&gt;  $ tickers                          : chr &quot;AAPL&quot;
#&gt;  $ exchanges                        : chr &quot;Nasdaq&quot;
#&gt;  $ ein                              : chr &quot;942404110&quot;

<sup>Created on 2023-07-27 with reprex v2.0.2</sup>

I'm from EU, I can open that JSON URL in the browser without any issues, but default jsonlite & httr2 agents are blocked. Using my browser's agent with httr2 works only when I also set accept-language. They check for some weird pattern in user agent when request is not coming from browser,
i.e. &quot;foo_bar&quot; - NOK / &quot;foo.bar&quot; - OK

huangapple
  • 本文由 发表于 2023年7月28日 01:22:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76782132.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定