2023年3月7日 18:12:07go评论195阅读模式

英文:

JSON file request from site returns error 403

问题

我尝试从类似于此的比赛盒分数中收集一些数据：https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/

数据存储在一个名为"data.json"的JSON文件中，我已经成功从Chrome的网络页面上下载了它（使用DevTools标签）。然后，我能够解析它并获取所需的数据。
现在，我正试图直接从URL中提取JSON数据（而不是下载文件），以自动化从相同类型的多个页面中收集数据。
我不是一个请求网站数据的专家，特别是如果它们不是静态的，信息是通过JSON或JavaScript动态获取的，所以请原谅我可能有关概念表达不准确的地方。

到目前为止，我尝试过以下方法：

url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/"

response = urlopen(url)
data = json.loads(response.read())

# 解析JSON并提取数据

这会产生以下错误：

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

然后，我尝试在URL的末尾添加"data.json"：

url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/data.json"

response = urlopen(url)
data = json.loads(response.read())

# 解析JSON并提取数据

这会产生：

urllib.error.HTTPError: HTTP Error 403: Forbidden

根据我了解的情况，在第一种情况下，请求返回为空，而在第二种情况下，它无法打开JSON文件。
我了解到，如果我没有手动打开Chrome的DevTools页面，https://.../data.json页面会返回403错误，但在我使用网络页面上的Ctrl+R重新加载页面后，它会正确加载data.json。
我理解的是，我需要执行一些其他操作，超出了requests.get()或urllib等方法，以便获取JSON文件。有人能指导我正确的方向吗？

英文:

I'm trying to collect some data from a game box score like this: https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/

The data is stored in a [tag:json] file ('data.json') which I managed to download from network page on chrome [tag:DevTools]. I've been able to then parse it and get the data I need.
Now I'm trying to pull the [tag:json] directly from the url (without downloading the file) to automate my data gathering from multiple pages of the same kind.
I'm no expert in requests from sites, especially if they are not static and the information is actively taken with a [tag:json]/[tag:javascript] so forgive any bad phrasing of the concepts.

This is what I've tried so far:

url = &quot;https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/&quot;

response = urlopen(url)
data = json.loads(response.read())

#json parsing and data gathering from data

which gives the error:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I then tried adding the 'data.json' at the end of the url:

url = &quot;https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/data.json&quot;

response = urlopen(url)
data = json.loads(response.read())

#json parsing and data gathering from data

which produces:

urllib.error.HTTPError: HTTP Error 403: Forbidden

From what I understand in the first case the request just comes up empty, while on the second case it is not able to open the [tag:json] file.
I understood that if I don't have manually opened the chrome [tag:devtools] page the https://.../data.json page returns the error 403, however it correctly loads the data.json after I reload the page with ctr+R on the network page.
What I understand is that I need to perform some other action beyond the requests.get() or anything similar from urllib , in order to pull down the json file.
Could someone point me in the right direction?

答案1

得分: 0

使用正确的URL在您的Python脚本中正确加载JSON。混淆之处在于您收到的是403代码，而不是404。

403代码是由于S3存储桶上的权限所致，如this blog post中所述，更详细地在AWS文档中说明：

如果您没有s3:ListBucket权限，Amazon S3将返回HTTP状态代码403（"拒绝访问"）错误。

如果查看失败请求的标头，它会报告由S3提供服务。

如果在加载HTML页面时查看Chrome开发者工具，实际数据的URL是：
https://fibalivestats.dcd.shared.geniussports.com/data/2213178/data.json

英文:

Using the correct URL in your Python script correctly loads the JSON. The confusion is that you get a 403 code rather than a 404.

The 403 code is due to the permissions on the s3 bucket, as described in this blog post and in more detail in the AWS docs

> If you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 (“access denied”) error.

If you look at the headers for the failed request, it reports that it is served by S3.

If you look at the chrome developer tools when loading the HTML page, the URL for the data actually is:
https://fibalivestats.dcd.shared.geniussports.com/data/2213178/data.json

答案2

得分: 0

你可以使用 [tag:selenium]。例如，我抓取了球员的名字，你可以开发并添加到代码中你想要的部分。

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(r'C:\Users\Krieg\Downloads\chromedriver_win32\chromedriver.exe')
driver.get(url)

x = driver.find_elements(By.CSS_SELECTOR, 'td.player-name.team-0-summary-leaders')
obj = {}
for player in x:
    print(z.text)

英文:

You can use [tag:selenium]. For ex. I scraped names of player You can develop and add to code what do yo want.

from selenium import webdriver
from selenium.webdriver.common.by import By


driver = webdriver.Chrome(r&#39;C:\Users\Krieg\Downloads\chromedriver_win32\chromedriver.exe&#39;)
driver.get(url)

x = driver.find_elements(By.CSS_SELECTOR, &#39;td.player-name.team-0-summary-leaders&#39;)
obj = {}
for player in x:
    print(z.text)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

JSON文件请求网站返回错误403。

问题

答案1

答案2

匹配不同数组中的数值。

有没有一种方法可以将文件路径字段转换为原地解析的模型？

在PyTorch中使用起始和结束索引切片1D张量。

如何删除具有最多NaN的DataFrame行？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论