2023年3月9日 22:35:06go评论310阅读模式

英文:

Web Scraping Python / Error 506 Invalid Request

问题

我正在尝试爬取这个网站"https://www.ticketweb.com/search?q="，但是即使我可以在检查器中看到HTML元素并且通过Python请求下载网页时也能成功，我仍然得到了这个错误。

以下是我的脚本内容：
```python
import requests

url_path = r'https://www.ticketweb.com/search?q='

HEADERS = {
    "Accept": "*/*",
    "Accept-Encoding": "utf-8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}

response = requests.get(url_path, headers=HEADERS)

content = response.text

print(content)

以下是响应内容：

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>506 Invalid request</title>
  </head>
  <body>
    <h1>Error 506 Invalid request</h1>
    <p>Invalid request</p>
    <h3>Error 54113</h3>
    <p>Details: cache-dfw-kdfw8210093-DFW 1678372070 120734701</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

英文:

I am trying to scrape this website "https://www.ticketweb.com/search?q=", but even though I can see the HTML elements in the inspector and download the webpage when I request it via Python, I only get that error.

Here is what I have in my script:

import requests

url_path = r&#39;https://www.ticketweb.com/search?q=&#39;

HEADERS = {
    &quot;Accept&quot;: &quot;*/*&quot;,
    &quot;Accept-Encoding&quot;: &quot;utf-8&quot;,
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36&quot;
}

response = requests.get(url_path, headers=HEADERS)

content = response.text

print(content)

Here is the response:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Strict//EN&quot;
 &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd&quot;&gt;
&lt;html&gt;
  &lt;head&gt;
    &lt;title&gt;506 Invalid request&lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;Error 506 Invalid request&lt;/h1&gt;
    &lt;p&gt;Invalid request&lt;/p&gt;
    &lt;h3&gt;Error 54113&lt;/h3&gt;
    &lt;p&gt;Details: cache-dfw-kdfw8210093-DFW 1678372070 120734701&lt;/p&gt;
    &lt;hr&gt;
    &lt;p&gt;Varnish cache server&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;

答案1

得分: 4

当您看到 506 时，请放心，这是来自您正在使用的客户端的问题，服务器无法处理您的请求。由于您正在使用明确发送本机HTTP请求的 requests，服务器端根据特定的TLS和JA3模式处理请求，因此您需要解决这个问题。

例如，调用https://tls.browserleaks.com/json将从Selenium和requests中获得不同的JA3，其原因是TLS。

您必须使用TLS客户端，因为JA3在密码中起作用，否则可以将requests注入TLS v1.2并进行一些修改。

此外，您还可以使用Curl-CFFI，链接如下：https://pypi.org/project/curl-cffi/

import tls_client

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.5',
}

def main():
    req = tls_client.Session(client_identifier="firefox113")
    req.headers.update(headers)
    params = {
        'page': 1
    }
    r = req.get("https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995", params=params)
    print(r)

if __name__ == "__main__":
    main()

输出:

对于requests也是一样的：

import ssl
import requests
from requests.adapters import HTTPAdapter
from urllib3.poolmanager import PoolManager
from urllib3.util.ssl_ import create_urllib3_context

CIPHERS = "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384"

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.5',
}

class TlsAdapter(HTTPAdapter):
    def __init__(self, ssl_options=0, **kwargs):
        self.ssl_options = ssl_options
        super(TlsAdapter, self).__init__(**kwargs)

    def init_poolmanager(self, *pool_args, **pool_kwargs):
        ctx = create_urllib3_context(
            ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
        self.poolmanager = PoolManager(
            *pool_args, ssl_context=ctx, **pool_kwargs)

def main():
    adapter = TlsAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)
    with requests.session() as req:
        req.mount("https://", adapter)
        req.headers.update(headers)
        params = {
            'page': 1
        }
        r = req.get("https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995", params=params)
        print(r)

if __name__ == "__main__":
    main()

输出:

英文:

Whenever you see 506, Rest assured that the issue from the client you are using where the server is unable to handle your request. Since you are using requests which is clearly sending a native http request, Where the server-end server the request based on specific TLS and JA3 pattern, Then you've to sort that.

For instance, Calling https://tls.browserleaks.com/json will giving different JA3 from selenium and requests, The reason behind that is TLS.

You have to use TLS client for that since JA3 is playing a game here within the ciphers, Otherwise inject requests with TLS v1.2 and do some modifications.

In addition, You can use Curl-CFFI as well --> https://pypi.org/project/curl-cffi/

import tls_client


headers = {
    &#39;accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8&#39;,
    &#39;accept-language&#39;: &#39;en-US,en;q=0.5&#39;,
}


def main():
    req = tls_client.Session(client_identifier=&quot;firefox113&quot;)
    req.headers.update(headers)
    params = {
        &#39;page&#39;: 1
    }
    r = req.get(
        &quot;https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995&quot;, params=params)
    print(r)


if __name__ == &quot;__main__&quot;:
    main()

Output:

you should have the full response.

Same for requests as well:

import ssl
import requests

from requests.adapters import HTTPAdapter
from urllib3.poolmanager import PoolManager
from urllib3.util.ssl_ import create_urllib3_context

CIPHERS = &quot;ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384&quot;


headers = {
    &#39;accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8&#39;,
    &#39;accept-language&#39;: &#39;en-US,en;q=0.5&#39;,
}


class TlsAdapter(HTTPAdapter):
    def __init__(self, ssl_options=0, **kwargs):
        self.ssl_options = ssl_options
        super(TlsAdapter, self).__init__(**kwargs)

    def init_poolmanager(self, *pool_args, **pool_kwargs):
        ctx = create_urllib3_context(
            ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
        self.poolmanager = PoolManager(
            *pool_args, ssl_context=ctx, **pool_kwargs)


def main():
    adapter = TlsAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)
    with requests.session() as req:
        req.mount(&quot;https://&quot;, adapter)
        req.headers.update(headers)
        params = {
            &#39;page&#39;: 1
        }
        r = req.get(
            &quot;https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995&quot;, params=params)
        print(r)


if __name__ == &quot;__main__&quot;:
    main()

Output:

答案2

得分: 3

看起来请求标头正在受到严格审查。在回答这个问题时，我已经稍微尝试了一下请求标头，例如，这是在撰写这个答案时的一个成功的请求：

import requests

url_path = r'https://www.ticketweb.com/search?q='

HEADERS = {
    "Accept-Language": "en-US,en",
    "Accept": "*/*;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}

response = requests.get(url_path, headers=HEADERS)
response.raise_for_status()
print(response.text)

这里有关请求标头中 q 参数的很好解释。简而言之（据我理解），它表明指令并非如此严格处理，这是请求者接受的。

我通过从 Firefox 请求中复制整个请求标头并尝试将其最小化来找到解决方案，同时也稍微尝试了 q 参数，正如之前提到的那样。

编辑：与此同时，这个请求已经不再有效。

重要提示

如果你阅读页面上的使用条款，你会看到类似于这样的内容：

[...] 你同意你不会：

使用任何机器人、蜘蛛 [...]

使用任何自动化软件或计算机系统搜索 [...]

因此，很可能网站所有者正在分析一些标准，以确定请求是由浏览器还是由机器发出的。如果他们认为是计算机程序在访问网站，他们可以阻止或操纵响应（例如返回一个空结果或返回任意状态码，如 506 或甚至 418 如果他们愿意）。

这意味着：网络抓取随时可能失败。特别是如果网站所有者不希望你自动下载他们的内容，因为站点运营者随时可以想出新的方法来阻止自动访问。

如果你被允许下载内容，你将不得不做更多的工作，例如使用 Selenium Web Driver，考虑使用 cookies，使请求时间看起来像人为的，并且也许不总是使用相同的IP地址进行自动访问，使用站点的缓存等等。

这在纯粹使用 requests 库或仅使用 curl 难以实现。所以，与其伪装成人类请求，为什么不使用浏览器并让浏览器为你执行请求呢？

这里是一个示例，演示如何使用 Selenium 的浏览器请求。这应该适用于 URL https://www.ticketweb.com/search?q=taylor+swift 和 driver.find_element(by=By.TAG_NAME, value="body")。浏览器还可以通过在浏览器选项中注入 --headless 来无头运行，因此在过程中无需看到浏览器界面。

但再次强调：网络抓取随时可能失败，请仔细阅读使用条款，以确定是否被允许自动阅读页面。

顺便说一下：utf-8 在这里未列为 Accept-Encoding 参数。但似乎你根本不需要它。

英文:

It seems that the request header is being critically scrutinized. I have played a bit with the request header, and e.g. this was a successful request at the time of writing this answer:

import requests

url_path = r&#39;https://www.ticketweb.com/search?q=&#39;

HEADERS = {
    &quot;Accept-Language&quot;: &quot;en-US,en&quot;,
    &quot;Accept&quot;: &quot;*/*;q=0.9&quot;,
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36&quot;
}

response = requests.get(url_path, headers=HEADERS)
response.raise_for_status()
print(response.text)

Here is a good explanation about the q-Parameter in the request header. tldr; (as far as I understood this) It indicates that the instruction is not handled quite so strictly, which you accept as the requester.

I came to the solution by copying the complete request header from a firefox request and tried to minimize it as far as I can, also played a bit with the q-Parameter as already mentioned.

EDIT: In the meanwhile this request is not working anymore

Important note

If you read the terms of use on the page, you will see something like this:

> [...] you agree that you will not:
> - Use any robot, spider [...]
> - Use any automated software or computer system to search for [...]

So it is very likely that the site owners analyzing some criteria to see if a request is made from a browser or from a machine. If they assume that a computer program is accessing the site, they can block or manipulate the response (e.g. returning an empty result or returning an arbitrary status code like 506 or even 418 if they want).

That means: Web scraping can fail at any time. Especially if the site owners don't want you to download their content automatically, because site operators can always come up with new things to prevent automated access.

If you are allowed to download the content, you will have to do more work, e.g. use selenium web driver, consider cookies, humanize the request times and maybe not always use the same IP address for the automated accesses, using caches from the site etc.

This is hard to do with purely the requests library or only using curl. So instead of faking a human request, why not using a browser and doing the request for you?

Here is an example how to request via Selenium's Browser. This should work for url https://www.ticketweb.com/search?q=taylor+swift and driver.find_element(by=By.TAG_NAME, value="body"). The browser can also be used headless by injecting --headless to the browser options, so no need to see the browser UI during the process.

But again: Web scraping can fail at any time and please read carefully the terms of use if you are allowed to read the page automatically at all.

BTW: utf-8 is not listed as Accept-Encoding parameter here. But it seems that you don't need it anyways.

答案3

得分: 3

以下是代码部分的翻译：

import json
from datetime import datetime

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.50"
}

google_cache_url = "https://webcache.googleusercontent.com/search?q=cache:";


def parse_date(event_date: str) -> str:
    return (
        datetime
        .strptime(event_date, "%Y-%m-%dT%H:%M")
        .strftime("%Y-%m-%d at %H:%M")
    )


def show_performers(performers: list) -> str:
    return ", ".join([performer["name"] for performer in performers])


def parse_event(script_element: str) -> list:
    venue_events = json.loads(script_element)
    parsed = []
    for event in venue_events:
        parsed.append(
            [
                event["name"],
                show_performers(event["performer"]),
                parse_date(event["startDate"]),
                event["offers"]["availability"],
                event["url"],
            ]
        )
    return parsed


if __name__ == "__main__":
    ticket_web_url = "https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1"

    response = requests.get(google_cache_url + ticket_web_url, headers=HEADERS)
    script = (
        BeautifulSoup(response.text, "html.parser")
        .select_one("script[type='application/ld+json']")
        .string
    )

    venue_table = pd.DataFrame(
        parse_event(script),
        columns=["Event", "Performers", "When", "Status", "URL"],
    )
    print(tabulate(venue_table, headers="keys", tablefmt="psql", showindex=False))

希望这对你有所帮助。如果你需要进一步的信息，请告诉我。

英文:

You can use the Google cache url to get to the site.

https://webcache.googleusercontent.com/search?q=cache:

Also, the data for the venue and event sits in a single <script> element and can be easily parsed.

The link I've used: https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1

For example:

import json
from datetime import datetime

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

HEADERS = {
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) &quot;
                  &quot;AppleWebKit/537.36 (KHTML, like Gecko) &quot;
                  &quot;Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.50&quot;
}

google_cache_url = &quot;https://webcache.googleusercontent.com/search?q=cache:&quot;


def parse_date(event_date: str) -&gt; str:
    return (
        datetime
        .strptime(event_date, &quot;%Y-%m-%dT%H:%M&quot;)
        .strftime(&quot;%Y-%m-%d at %H:%M&quot;)
    )


def show_performers(performers: list) -&gt; str:
    return &quot;, &quot;.join([performer[&quot;name&quot;] for performer in performers])


def parse_event(script_element: str) -&gt; list:
    venue_events = json.loads(script_element)
    parsed = []
    for event in venue_events:
        parsed.append(
            [
                event[&quot;name&quot;],
                show_performers(event[&quot;performer&quot;]),
                parse_date(event[&quot;startDate&quot;]),
                event[&quot;offers&quot;][&quot;availability&quot;],
                event[&quot;url&quot;],
            ]
        )
    return parsed


if __name__ == &quot;__main__&quot;:
    ticket_web_url = &quot;https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1&quot;

    response = requests.get(google_cache_url + ticket_web_url, headers=HEADERS)
    script = (
        BeautifulSoup(response.text, &quot;html.parser&quot;)
        .select_one(&quot;script[type=&#39;application/ld+json&#39;]&quot;)
        .string
    )

    venue_table = pd.DataFrame(
        parse_event(script),
        columns=[&quot;Event&quot;, &quot;Performers&quot;, &quot;When&quot;, &quot;Status&quot;, &quot;URL&quot;],
    )
    print(tabulate(venue_table, headers=&quot;keys&quot;, tablefmt=&quot;psql&quot;, showindex=False))

Prints:

+-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------+
| Event                                               | Performers                                     | When                | Status   | URL                                                                                                 |
|-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------|
| Homixide Gang: Snot of Not Tour                     | Homixide Gang,  Sid Shyne, Biggaveli, Lil He77 | 2023-05-25 at 20:00 | SoldOut  | https://www.ticketweb.com/event/homixide-gang-snot-of-not-the-new-parish-tickets/13096395           |
| La Sonora Dinamita, Suenatron, El Dusty             | LA Sonora Dinamita, Suenatron, El Dusty        | 2023-05-27 at 21:00 | InStock  | https://www.ticketweb.com/event/la-sonora-dinamita-suenatron-el-the-new-parish-tickets/13220068     |
| Reggae Gold XL presents: The Give Thankz Reunion    | Reggae Gold XL                                 | 2023-05-28 at 21:00 | InStock  | https://www.ticketweb.com/event/reggae-gold-xl-presents-the-the-new-parish-tickets/13228028         |
| THE OFFICIAL OAKLAND CARNIVAL AFTER-PARTY           | Oakland Carnival, SambaFunk, Kenny Mann        | 2023-06-03 at 22:00 | InStock  | https://www.ticketweb.com/event/the-official-oakland-carnival-after-the-new-parish-tickets/13236848 |
| WARD DAVIS                                          | Ward Davis                                     | 2023-06-08 at 20:00 | InStock  | https://www.ticketweb.com/event/ward-davis-the-new-parish-tickets/13127855                          |
| Casey Veggies                                       | Casey Veggies                                  | 2023-06-09 at 20:30 | InStock  | https://www.ticketweb.com/event/casey-veggies-the-new-parish-tickets/13151618                       |
| Casey Veggies                                       | Casey Veggies                                  | 2023-06-09 at 21:00 | InStock  | https://www.ticketweb.com/event/casey-veggies-the-new-parish-tickets/13160998                       |
| Mortified presents: Morti-Pride!                    | MORTIFIED                                      | 2023-06-10 at 19:30 | InStock  | https://www.ticketweb.com/event/mortified-presents-morti-pride-the-new-parish-tickets/13126705      |
| ZelooperZ: Traptastic Tour                          | ZelooperZ                                      | 2023-06-13 at 20:00 | InStock  | https://www.ticketweb.com/event/zelooperz-traptastic-tour-the-new-parish-tickets/13205488           |
| Ab-Soul: The Intelligent Movement Tour              | Ab-Soul                                        | 2023-06-14 at 20:00 | InStock  | https://www.ticketweb.com/event/ab-soul-the-intelligent-movement-the-new-parish-tickets/13156258    |
| Ab-Soul: The Intelligent Movement Tour              | Ab-Soul                                        | 2023-06-15 at 20:00 | SoldOut  | https://www.ticketweb.com/event/ab-soul-the-intelligent-movement-the-new-parish-tickets/13108785    |
| THE COLORS OF BANG YONG GUK: THE US TOUR 2023       | BANG YONGGUK                                   | 2023-06-16 at 19:00 | InStock  | https://www.ticketweb.com/event/the-colors-of-bang-yong-the-new-parish-tickets/13115845             |
| THE COLORS OF BANG YONG GUK: THE US TOUR 2023       | Bang Yongguk                                   | 2023-06-16 at 19:00 | InStock  | https://www.ticketweb.com/event/the-colors-of-bang-yong-the-new-parish-tickets/13115565             |
| BashfortheWorld                                     | BashfortheWorld                                | 2023-06-17 at 21:00 | SoldOut  | https://www.ticketweb.com/event/bashfortheworld-the-new-parish-tickets/13116985                     |
| Frank Zappa Tribute with The Stinkfoot Orchestra    | The Stinkfoot Orchestra                        | 2023-06-23 at 20:00 | InStock  | https://www.ticketweb.com/event/frank-zappa-tribute-with-the-the-new-parish-tickets/13198478        |
| Hip Hop For The People&#39;s Health And Wellness Summit | Inspectah Deck                                 | 2023-06-25 at 21:00 | InStock  | https://www.ticketweb.com/event/hip-hop-for-the-peoples-the-new-parish-tickets/13161098             |
| 03 Greedo                                           | 03 Greedo                                      | 2023-06-28 at 20:00 | InStock  | https://www.ticketweb.com/event/03-greedo-the-new-parish-tickets/13155688                           |
| 03 Greedo                                           | 03 Greedo                                      | 2023-06-29 at 20:00 | InStock  | https://www.ticketweb.com/event/03-greedo-the-new-parish-tickets/13175908                           |
| K-Pop Mixtape Party                                 | Alawn                                          | 2023-07-01 at 20:30 | InStock  | https://www.ticketweb.com/event/k-pop-mixtape-party-the-new-parish-tickets/13241338                 |
| LOJAY - GANGSTER ROMANTIC                           | Lojay                                          | 2023-07-05 at 20:00 | InStock  | https://www.ticketweb.com/event/lojay-gangster-romantic-the-new-parish-tickets/13234278             |
+-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

网络爬虫 Python / 错误 506 无效请求

问题

答案1

答案2

答案3

重命名返回NoneType对象。

使用小数范围

尝试使用`df.style.set_properties`更改数据框的列宽，但不起作用。

numpy array.all() solution for multidimensional array where array.all(axis=1).all(axis=1) gives desired result

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论