JSON文件请求网站返回错误403。

huangapple go评论79阅读模式
英文:

JSON file request from site returns error 403

问题

我尝试从类似于此的比赛盒分数中收集一些数据:https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/

数据存储在一个名为"data.json"的JSON文件中,我已经成功从Chrome的网络页面上下载了它(使用DevTools标签)。然后,我能够解析它并获取所需的数据。
现在,我正试图直接从URL中提取JSON数据(而不是下载文件),以自动化从相同类型的多个页面中收集数据。
我不是一个请求网站数据的专家,特别是如果它们不是静态的,信息是通过JSON或JavaScript动态获取的,所以请原谅我可能有关概念表达不准确的地方。

到目前为止,我尝试过以下方法:

url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/"

response = urlopen(url)
data = json.loads(response.read())

# 解析JSON并提取数据

这会产生以下错误:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

然后,我尝试在URL的末尾添加"data.json":

url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/data.json"

response = urlopen(url)
data = json.loads(response.read())

# 解析JSON并提取数据

这会产生:

urllib.error.HTTPError: HTTP Error 403: Forbidden

根据我了解的情况,在第一种情况下,请求返回为空,而在第二种情况下,它无法打开JSON文件。
我了解到,如果我没有手动打开Chrome的DevTools页面,https://.../data.json页面会返回403错误,但在我使用网络页面上的Ctrl+R重新加载页面后,它会正确加载data.json。
我理解的是,我需要执行一些其他操作,超出了requests.get()或urllib等方法,以便获取JSON文件。有人能指导我正确的方向吗?

英文:

I'm trying to collect some data from a game box score like this: https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/

The data is stored in a [tag:json] file ('data.json') which I managed to download from network page on chrome [tag:DevTools]. I've been able to then parse it and get the data I need.
Now I'm trying to pull the [tag:json] directly from the url (without downloading the file) to automate my data gathering from multiple pages of the same kind.
I'm no expert in requests from sites, especially if they are not static and the information is actively taken with a [tag:json]/[tag:javascript] so forgive any bad phrasing of the concepts.

This is what I've tried so far:

url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/"

response = urlopen(url)
data = json.loads(response.read())

#json parsing and data gathering from data

which gives the error:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I then tried adding the 'data.json' at the end of the url:

url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/data.json"

response = urlopen(url)
data = json.loads(response.read())

#json parsing and data gathering from data

which produces:

urllib.error.HTTPError: HTTP Error 403: Forbidden

From what I understand in the first case the request just comes up empty, while on the second case it is not able to open the [tag:json] file.
I understood that if I don't have manually opened the chrome [tag:devtools] page the https://.../data.json page returns the error 403, however it correctly loads the data.json after I reload the page with ctr+R on the network page.
What I understand is that I need to perform some other action beyond the requests.get() or anything similar from urllib , in order to pull down the json file.
Could someone point me in the right direction?

答案1

得分: 0

使用正确的URL在您的Python脚本中正确加载JSON。混淆之处在于您收到的是403代码,而不是404。

403代码是由于S3存储桶上的权限所致,如this blog post中所述,更详细地在AWS文档中说明:

如果您没有s3:ListBucket权限,Amazon S3将返回HTTP状态代码403("拒绝访问")错误。

如果查看失败请求的标头,它会报告由S3提供服务。

如果在加载HTML页面时查看Chrome开发者工具,实际数据的URL是:
https://fibalivestats.dcd.shared.geniussports.com/data/2213178/data.json

英文:

Using the correct URL in your Python script correctly loads the JSON. The confusion is that you get a 403 code rather than a 404.

The 403 code is due to the permissions on the s3 bucket, as described in this blog post and in more detail in the AWS docs

> If you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 (“access denied”) error.

If you look at the headers for the failed request, it reports that it is served by S3.

If you look at the chrome developer tools when loading the HTML page, the URL for the data actually is:
https://fibalivestats.dcd.shared.geniussports.com/data/2213178/data.json

答案2

得分: 0

你可以使用 [tag:selenium]。例如,我抓取了球员的名字,你可以开发并添加到代码中你想要的部分。

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(r'C:\Users\Krieg\Downloads\chromedriver_win32\chromedriver.exe')
driver.get(url)

x = driver.find_elements(By.CSS_SELECTOR, 'td.player-name.team-0-summary-leaders')
obj = {}
for player in x:
    print(z.text)
英文:

You can use [tag:selenium]. For ex. I scraped names of player You can develop and add to code what do yo want.

from selenium import webdriver
from selenium.webdriver.common.by import By


driver = webdriver.Chrome(r'C:\Users\Krieg\Downloads\chromedriver_win32\chromedriver.exe')
driver.get(url)

x = driver.find_elements(By.CSS_SELECTOR, 'td.player-name.team-0-summary-leaders')
obj = {}
for player in x:
    print(z.text)

huangapple
  • 本文由 发表于 2023年3月7日 18:12:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75660582.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定