2023年7月17日 23:32:09go评论148阅读模式

英文:

why did my web-scraping method stop working on one particular site?

问题

几个月前，我经常使用一个Python脚本来从特定网站抓取和解析篮球赔率。几个月没有使用后，我尝试运行相同的脚本，但发现现在出错了。

我正在寻找1) 脚本现在失败的原因，以及2) 一个可行的解决方法。

引发错误的代码行如下。我使用这种方法从其他网站抓取数据没有问题。

source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()

以前，上述命令会获取可用的源数据。现在，会出现以下错误：JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

我尝试了一种替代的抓取方法，用于相同的目标站点。有趣的是，当我逐行输入以下命令时，我可以成功获取数据。但当我将代码作为脚本运行时，却无法获取数据。

browser = webdriver.Chrome()
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source

这个特定的目标站点是否受到自动抓取的保护？是否有任何解决方法？

英文:

several months ago I regularly used a python script to scrape and parse basketball odds from a particular website. after a couple months without using I tried to run the same script, only to find it now throws an error.

I'm looking for 1) the reason the script now fails, and 2) a functioning workaround.

the line of code which is the source of the error is below. i use this method to scrape other websites without issue.

source = requests.get(&#39;https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball&#39;).json()

previously, the above command would acquire usable source data. now, "JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)"

I tried an alternate scraping method for the same target site. interestingly, when I enter the commands (below) line-by-line, I can successfully acquire the data. when I run the code as a script, no data is acquired.

browser = webdriver.Chrome() 
browser.get(&#39;https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball&#39;)
page_source = browser.page_source

Is this specific target site is somehow protected against automated scraping? are there any workarounds?

答案1

得分: 1

我成功地从服务器获取到了正确的响应，设置了 `User-Agent` 头部并强制禁用了缓存，使用了虚拟的URL参数。例如：

```py
import time
import requests

api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}

source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())

打印：

[
    {
        "path": [
            {
                "id": "11344232",
                "link": "/basketball/nba-futures/nba-championship-2023-24",
                "description": "NBA Championship 2023/24",
                "type": "LEAGUE",
                "sportCode": "BASK",
                "order": 9223372036854775807,

...

英文:

I was able to get correct response from the server setting the User-Agent header and forcing disabling the caching using dummy url parameter. E.g.:

import time
import requests

api_url = &#39;https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball&#39;
headers = {
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0&#39;,
}

source = requests.get(api_url, headers=headers, params={&#39;_t&#39;: int(time.time())})
print(source.json())

Prints:

[
    {
        &quot;path&quot;: [
            {
                &quot;id&quot;: &quot;11344232&quot;,
                &quot;link&quot;: &quot;/basketball/nba-futures/nba-championship-2023-24&quot;,
                &quot;description&quot;: &quot;NBA Championship 2023/24&quot;,
                &quot;type&quot;: &quot;LEAGUE&quot;,
                &quot;sportCode&quot;: &quot;BASK&quot;,
                &quot;order&quot;: 9223372036854775807,

...

答案2

得分: 0

仅翻译代码部分：

import requests
from pprint import pp

base_url = 'https://www.bovada.lv'
url = (
    'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
)

session, timeout = requests.Session(), 3.05
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
})
session.mount(base_url, requests.adapters.HTTPAdapter())

response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())

英文:

works when 1) a valid user-agent is set and 2) using a requests.Session to get the homepage first (maybe is sets some cookie).

import requests
from pprint import pp


base_url = &#39;https://www.bovada.lv&#39;
url = (
    &#39;https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball&#39;
    )

session, timeout = requests.Session(), 3.05
session.headers.update({
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0&#39;
    })
session.mount(base_url, requests.adapters.HTTPAdapter())

response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么我的网络爬虫方法在某个特定网站上停止工作了？

问题

答案1

答案2

Tkinter问题：在可滚动的画布中获取多个滑块的值

如何更快地迭代这两个非常大的警报数据框？

使用row_number()删除SQLAlchemy查询

在csv.reader中查找条件附加的字符串值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论