为什么我的网络爬虫方法在某个特定网站上停止工作了?

huangapple go评论53阅读模式
英文:

why did my web-scraping method stop working on one particular site?

问题

几个月前,我经常使用一个Python脚本来从特定网站抓取和解析篮球赔率。几个月没有使用后,我尝试运行相同的脚本,但发现现在出错了。

我正在寻找1) 脚本现在失败的原因,以及2) 一个可行的解决方法。

引发错误的代码行如下。我使用这种方法从其他网站抓取数据没有问题。

source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()

以前,上述命令会获取可用的源数据。现在,会出现以下错误:JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

我尝试了一种替代的抓取方法,用于相同的目标站点。有趣的是,当我逐行输入以下命令时,我可以成功获取数据。但当我将代码作为脚本运行时,却无法获取数据。

browser = webdriver.Chrome()
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source

这个特定的目标站点是否受到自动抓取的保护?是否有任何解决方法?

英文:

several months ago I regularly used a python script to scrape and parse basketball odds from a particular website. after a couple months without using I tried to run the same script, only to find it now throws an error.

I'm looking for 1) the reason the script now fails, and 2) a functioning workaround.

the line of code which is the source of the error is below. i use this method to scrape other websites without issue.

source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()

previously, the above command would acquire usable source data. now, "JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)"

I tried an alternate scraping method for the same target site. interestingly, when I enter the commands (below) line-by-line, I can successfully acquire the data. when I run the code as a script, no data is acquired.

browser = webdriver.Chrome() 
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source 

Is this specific target site is somehow protected against automated scraping? are there any workarounds?

答案1

得分: 1

我成功地从服务器获取到了正确的响应设置了 `User-Agent` 头部并强制禁用了缓存使用了虚拟的URL参数例如

```py
import time
import requests

api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}

source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())

打印:

[
    {
        "path": [
            {
                "id": "11344232",
                "link": "/basketball/nba-futures/nba-championship-2023-24",
                "description": "NBA Championship 2023/24",
                "type": "LEAGUE",
                "sportCode": "BASK",
                "order": 9223372036854775807,

...
英文:

I was able to get correct response from the server setting the User-Agent header and forcing disabling the caching using dummy url parameter. E.g.:

import time
import requests

api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}

source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())

Prints:

[
    {
        "path": [
            {
                "id": "11344232",
                "link": "/basketball/nba-futures/nba-championship-2023-24",
                "description": "NBA Championship 2023/24",
                "type": "LEAGUE",
                "sportCode": "BASK",
                "order": 9223372036854775807,

...

答案2

得分: 0

仅翻译代码部分:

import requests
from pprint import pp

base_url = 'https://www.bovada.lv'
url = (
    'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
)

session, timeout = requests.Session(), 3.05
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
})
session.mount(base_url, requests.adapters.HTTPAdapter())

response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())
英文:

works when 1) a valid user-agent is set and 2) using a requests.Session to get the homepage first (maybe is sets some cookie).

import requests
from pprint import pp


base_url = 'https://www.bovada.lv'
url = (
    'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
    )

session, timeout = requests.Session(), 3.05
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
    })
session.mount(base_url, requests.adapters.HTTPAdapter())

response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())

huangapple
  • 本文由 发表于 2023年7月17日 23:32:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76706024.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定