英文:
why did my web-scraping method stop working on one particular site?
问题
几个月前,我经常使用一个Python脚本来从特定网站抓取和解析篮球赔率。几个月没有使用后,我尝试运行相同的脚本,但发现现在出错了。
我正在寻找1) 脚本现在失败的原因,以及2) 一个可行的解决方法。
引发错误的代码行如下。我使用这种方法从其他网站抓取数据没有问题。
source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()
以前,上述命令会获取可用的源数据。现在,会出现以下错误:JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
我尝试了一种替代的抓取方法,用于相同的目标站点。有趣的是,当我逐行输入以下命令时,我可以成功获取数据。但当我将代码作为脚本运行时,却无法获取数据。
browser = webdriver.Chrome()
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source
这个特定的目标站点是否受到自动抓取的保护?是否有任何解决方法?
英文:
several months ago I regularly used a python script to scrape and parse basketball odds from a particular website. after a couple months without using I tried to run the same script, only to find it now throws an error.
I'm looking for 1) the reason the script now fails, and 2) a functioning workaround.
the line of code which is the source of the error is below. i use this method to scrape other websites without issue.
source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()
previously, the above command would acquire usable source data. now, "JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)"
I tried an alternate scraping method for the same target site. interestingly, when I enter the commands (below) line-by-line, I can successfully acquire the data. when I run the code as a script, no data is acquired.
browser = webdriver.Chrome()
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source
Is this specific target site is somehow protected against automated scraping? are there any workarounds?
答案1
得分: 1
我成功地从服务器获取到了正确的响应,设置了 `User-Agent` 头部并强制禁用了缓存,使用了虚拟的URL参数。例如:
```py
import time
import requests
api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}
source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())
打印:
[
{
"path": [
{
"id": "11344232",
"link": "/basketball/nba-futures/nba-championship-2023-24",
"description": "NBA Championship 2023/24",
"type": "LEAGUE",
"sportCode": "BASK",
"order": 9223372036854775807,
...
英文:
I was able to get correct response from the server setting the User-Agent
header and forcing disabling the caching using dummy url parameter. E.g.:
import time
import requests
api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}
source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())
Prints:
[
{
"path": [
{
"id": "11344232",
"link": "/basketball/nba-futures/nba-championship-2023-24",
"description": "NBA Championship 2023/24",
"type": "LEAGUE",
"sportCode": "BASK",
"order": 9223372036854775807,
...
答案2
得分: 0
仅翻译代码部分:
import requests
from pprint import pp
base_url = 'https://www.bovada.lv'
url = (
'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
)
session, timeout = requests.Session(), 3.05
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
})
session.mount(base_url, requests.adapters.HTTPAdapter())
response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())
英文:
works when 1) a valid user-agent
is set and 2) using a requests.Session
to get the homepage first (maybe is sets some cookie).
import requests
from pprint import pp
base_url = 'https://www.bovada.lv'
url = (
'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
)
session, timeout = requests.Session(), 3.05
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
})
session.mount(base_url, requests.adapters.HTTPAdapter())
response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论