英文:
Why am I getting a 403 error when connecting to a url that works
问题
我正在尝试从SEC政府网站获取一家公司的季度结束日期。由于某种原因,我一直在遇到连接错误。这段代码在我在加拿大的朋友那里无法工作,但在美国的朋友那里可以。我尝试使用VPN,但仍然遇到相同的错误。以下是代码和我遇到的错误信息。
library(derivmkts)
library(quantmod)
library(jsonlite)
library(tidyverse)
url = "https://data.sec.gov/submissions/CIK0000320193.json"
df <- fromJSON(url, flatten = TRUE)
Error in open.connection(con, "rb") :
cannot open the connection to 'https://data.sec.gov/submissions/CIK0000320193.json'
In addition: Warning message:
In open.connection(con, "rb") :
cannot open URL 'https://data.sec.gov/submissions/CIK0000320193.json': HTTP status was '403 Forbidden'
我不希望在连接到此URL时出现403错误。
英文:
I am trying to pull the quarter end dates for a company from the SEC government website. For some reason I keep getting a connection error. The code is working for my friend who is in the US, but not for me in Canada. I tried using a VPN, but was still getting the same error. Here is the code and the error that I was getting.
When I put the url into google it brings me to the page with all the information so I am not sure why I cant pull it into R.
library(derivmkts)
library(quantmod)
library(jsonlite)
library(tidyverse)
url = "https://data.sec.gov/submissions/CIK0000320193.json"
df <- fromJSON(url, flatten = T)
Error in open.connection(con, "rb") :
cannot open the connection to 'https://data.sec.gov/submissions/CIK0000320193.json'
In addition: Warning message:
In open.connection(con, "rb") :
cannot open URL 'https://data.sec.gov/submissions/CIK0000320193.json': HTTP status was '403 Forbidden'
I am not expecting a 403 error when connecting to this url
答案1
得分: 2
他们要求您在请求标头中声明用户代理 - https://www.sec.gov/os/accessing-edgar-data
显然,提供的示例也被接受,尽管您真的应该在那里提供您的联系方式。
使用 httr2
仍然使用 jsonlite
来解析 JSON 响应:
library(httr2)
resp <- request("https://data.sec.gov/submissions/CIK0000320193.json") |>
req_user_agent("Sample Company Name AdminContact@<sample company domain>.com") |>
# 设置调试的详细程度,1:显示标头
req_perform(verbosity = 1)
#> -> GET /submissions/CIK0000320193.json HTTP/1.1
#> -> Host: data.sec.gov
#> -> User-Agent: Sample Company Name AdminContact@<sample company domain>.com
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> ->
#> <- HTTP/1.1 200 OK
#> <- Content-Type: application/json
#> <- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
#> <- Access-Control-Allow-Origin: *
#> <- x-amz-apigw-id: IvJu4HiHIAMFidw=
#> <- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
#> <- Vary: Accept-Encoding
#> <- Content-Encoding: gzip
#> <- Expires: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Cache-Control: max-age=0, no-cache, no-store
#> <- Pragma: no-cache
#> <- Date: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Content-Length: 28594
#> <- Connection: keep-alive
#> <- Strict-Transport-Security: max-age=31536000 ; preload
#> <- Set-Cookie: ak_bmsc=E9...
resp
#> <httr2_response>
#> GET https://data.sec.gov/submissions/CIK0000320193.json
#> 状态: 200 OK
#> 内容类型: application/json
#> 主体: In memory (157568 字节)
# JSON 中的前几个键/值:
resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |>
head(n = 10) |>
str()
#> List of 10
#> $ cik : chr "320193"
#> $ entityType : chr "operating"
#> $ sic : chr "3571"
#> $ sicDescription : chr "Electronic Computers"
#> $ insiderTransactionForOwnerExists : int 0
#> $ insiderTransactionForIssuerExists: int 1
#> $ name : chr "Apple Inc."
#> $ tickers : chr "AAPL"
#> $ exchanges : chr "Nasdaq"
#> $ ein : chr "942404110"
创建于2023-07-27,使用 reprex v2.0.2
我来自欧盟,可以在浏览器中打开该 JSON URL,没有任何问题,但默认的 jsonlite
和 httr2
代理被阻止。只有当我还设置了 accept-language
时,使用浏览器的代理与 httr2
一起工作。当请求不来自浏览器时,他们会检查用户代理中的一些奇怪模式,例如 "foo_bar"
- 不可以 / "foo.bar"
- 可以。
英文:
They ask you to declare user agent in request headers - https://www.sec.gov/os/accessing-edgar-data
Apparently the one provided as an example is also accepted, though you really should provide your contact details there.
With httr2
, it still uses jsonlite
for parsing JSON responses:
library(httr2)
resp <- request("https://data.sec.gov/submissions/CIK0000320193.json") |>
req_user_agent("Sample Company Name AdminContact@<sample company domain>.com") |>
# set verbosity level for debugging, 1: show headers
req_perform(verbosity = 1)
#> -> GET /submissions/CIK0000320193.json HTTP/1.1
#> -> Host: data.sec.gov
#> -> User-Agent: Sample Company Name AdminContact@<sample company domain>.com
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> ->
#> <- HTTP/1.1 200 OK
#> <- Content-Type: application/json
#> <- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
#> <- Access-Control-Allow-Origin: *
#> <- x-amz-apigw-id: IvJu4HiHIAMFidw=
#> <- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
#> <- Vary: Accept-Encoding
#> <- Content-Encoding: gzip
#> <- Expires: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Cache-Control: max-age=0, no-cache, no-store
#> <- Pragma: no-cache
#> <- Date: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Content-Length: 28594
#> <- Connection: keep-alive
#> <- Strict-Transport-Security: max-age=31536000 ; preload
#> <- Set-Cookie: ak_bmsc=E9...
resp
#> <httr2_response>
#> GET https://data.sec.gov/submissions/CIK0000320193.json
#> Status: 200 OK
#> Content-Type: application/json
#> Body: In memory (157568 bytes)
# first few keys / values from JSON:
resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |>
head(n = 10) |>
str()
#> List of 10
#> $ cik : chr "320193"
#> $ entityType : chr "operating"
#> $ sic : chr "3571"
#> $ sicDescription : chr "Electronic Computers"
#> $ insiderTransactionForOwnerExists : int 0
#> $ insiderTransactionForIssuerExists: int 1
#> $ name : chr "Apple Inc."
#> $ tickers : chr "AAPL"
#> $ exchanges : chr "Nasdaq"
#> $ ein : chr "942404110"
<sup>Created on 2023-07-27 with reprex v2.0.2</sup>
I'm from EU, I can open that JSON URL in the browser without any issues, but default jsonlite
& httr2
agents are blocked. Using my browser's agent with httr2
works only when I also set accept-language
. They check for some weird pattern in user agent when request is not coming from browser,
i.e. "foo_bar"
- NOK / "foo.bar"
- OK
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论