网络爬虫 Python / 错误 506 无效请求

huangapple go评论155阅读模式
英文:

Web Scraping Python / Error 506 Invalid Request

问题

我正在尝试爬取这个网站"https://www.ticketweb.com/search?q=",但是即使我可以在检查器中看到HTML元素并且通过Python请求下载网页时也能成功,我仍然得到了这个错误。

以下是我的脚本内容:
```python
import requests

url_path = r'https://www.ticketweb.com/search?q='

HEADERS = {
    "Accept": "*/*",
    "Accept-Encoding": "utf-8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}

response = requests.get(url_path, headers=HEADERS)

content = response.text

print(content)

以下是响应内容:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>506 Invalid request</title>
  </head>
  <body>
    <h1>Error 506 Invalid request</h1>
    <p>Invalid request</p>
    <h3>Error 54113</h3>
    <p>Details: cache-dfw-kdfw8210093-DFW 1678372070 120734701</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>
英文:

I am trying to scrape this website "https://www.ticketweb.com/search?q=", but even though I can see the HTML elements in the inspector and download the webpage when I request it via Python, I only get that error.

Here is what I have in my script:

import requests

url_path = r&#39;https://www.ticketweb.com/search?q=&#39;

HEADERS = {
    &quot;Accept&quot;: &quot;*/*&quot;,
    &quot;Accept-Encoding&quot;: &quot;utf-8&quot;,
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36&quot;
}

response = requests.get(url_path, headers=HEADERS)

content = response.text

print(content)

Here is the response:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Strict//EN&quot;
 &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd&quot;&gt;
&lt;html&gt;
  &lt;head&gt;
    &lt;title&gt;506 Invalid request&lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;Error 506 Invalid request&lt;/h1&gt;
    &lt;p&gt;Invalid request&lt;/p&gt;
    &lt;h3&gt;Error 54113&lt;/h3&gt;
    &lt;p&gt;Details: cache-dfw-kdfw8210093-DFW 1678372070 120734701&lt;/p&gt;
    &lt;hr&gt;
    &lt;p&gt;Varnish cache server&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;

答案1

得分: 4

当您看到 506 时,请放心,这是来自您正在使用的客户端的问题,服务器无法处理您的请求。由于您正在使用明确发送本机HTTP请求的 requests,服务器端根据特定的TLS和JA3模式处理请求,因此您需要解决这个问题。

例如,调用https://tls.browserleaks.com/json将从Selenium和requests中获得不同的JA3,其原因是TLS。

您必须使用TLS客户端,因为JA3在密码中起作用,否则可以将requests注入TLS v1.2并进行一些修改。

此外,您还可以使用Curl-CFFI,链接如下:https://pypi.org/project/curl-cffi/

网络爬虫 Python / 错误 506 无效请求

import tls_client

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.5',
}

def main():
    req = tls_client.Session(client_identifier="firefox113")
    req.headers.update(headers)
    params = {
        'page': 1
    }
    r = req.get("https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995", params=params)
    print(r)

if __name__ == "__main__":
    main()

输出:

200

对于requests也是一样的:

import ssl
import requests
from requests.adapters import HTTPAdapter
from urllib3.poolmanager import PoolManager
from urllib3.util.ssl_ import create_urllib3_context

CIPHERS = "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384"

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.5',
}

class TlsAdapter(HTTPAdapter):
    def __init__(self, ssl_options=0, **kwargs):
        self.ssl_options = ssl_options
        super(TlsAdapter, self).__init__(**kwargs)

    def init_poolmanager(self, *pool_args, **pool_kwargs):
        ctx = create_urllib3_context(
            ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
        self.poolmanager = PoolManager(
            *pool_args, ssl_context=ctx, **pool_kwargs)

def main():
    adapter = TlsAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)
    with requests.session() as req:
        req.mount("https://", adapter)
        req.headers.update(headers)
        params = {
            'page': 1
        }
        r = req.get("https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995", params=params)
        print(r)

if __name__ == "__main__":
    main()

输出:

200
英文:

Whenever you see 506, Rest assured that the issue from the client you are using where the server is unable to handle your request. Since you are using requests which is clearly sending a native http request, Where the server-end server the request based on specific TLS and JA3 pattern, Then you've to sort that.

For instance, Calling https://tls.browserleaks.com/json will giving different JA3 from selenium and requests, The reason behind that is TLS.

You have to use TLS client for that since JA3 is playing a game here within the ciphers, Otherwise inject requests with TLS v1.2 and do some modifications.

In addition, You can use Curl-CFFI as well --> https://pypi.org/project/curl-cffi/

网络爬虫 Python / 错误 506 无效请求

import tls_client


headers = {
    &#39;accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8&#39;,
    &#39;accept-language&#39;: &#39;en-US,en;q=0.5&#39;,
}


def main():
    req = tls_client.Session(client_identifier=&quot;firefox113&quot;)
    req.headers.update(headers)
    params = {
        &#39;page&#39;: 1
    }
    r = req.get(
        &quot;https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995&quot;, params=params)
    print(r)


if __name__ == &quot;__main__&quot;:
    main()

Output:

200

you should have the full response.

Same for requests as well:

import ssl
import requests

from requests.adapters import HTTPAdapter
from urllib3.poolmanager import PoolManager
from urllib3.util.ssl_ import create_urllib3_context

CIPHERS = &quot;ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384&quot;


headers = {
    &#39;accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8&#39;,
    &#39;accept-language&#39;: &#39;en-US,en;q=0.5&#39;,
}


class TlsAdapter(HTTPAdapter):
    def __init__(self, ssl_options=0, **kwargs):
        self.ssl_options = ssl_options
        super(TlsAdapter, self).__init__(**kwargs)

    def init_poolmanager(self, *pool_args, **pool_kwargs):
        ctx = create_urllib3_context(
            ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
        self.poolmanager = PoolManager(
            *pool_args, ssl_context=ctx, **pool_kwargs)


def main():
    adapter = TlsAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)
    with requests.session() as req:
        req.mount(&quot;https://&quot;, adapter)
        req.headers.update(headers)
        params = {
            &#39;page&#39;: 1
        }
        r = req.get(
            &quot;https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995&quot;, params=params)
        print(r)


if __name__ == &quot;__main__&quot;:
    main()

Output:

200

答案2

得分: 3

看起来请求标头正在受到严格审查。在回答这个问题时,我已经稍微尝试了一下请求标头,例如,这是在撰写这个答案时的一个成功的请求:

import requests

url_path = r'https://www.ticketweb.com/search?q='

HEADERS = {
    "Accept-Language": "en-US,en",
    "Accept": "*/*;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}

response = requests.get(url_path, headers=HEADERS)
response.raise_for_status()
print(response.text)

这里 有关请求标头中 q 参数的很好解释。简而言之(据我理解),它表明指令并非如此严格处理,这是请求者接受的。

我通过从 Firefox 请求中复制整个请求标头并尝试将其最小化来找到解决方案,同时也稍微尝试了 q 参数,正如之前提到的那样。

编辑:与此同时,这个请求已经不再有效。

重要提示

如果你阅读页面上的使用条款,你会看到类似于这样的内容:

[...] 你同意你不会:

  • 使用任何机器人、蜘蛛 [...]
  • 使用任何自动化软件或计算机系统搜索 [...]

因此,很可能网站所有者正在分析一些标准,以确定请求是由浏览器还是由机器发出的。如果他们认为是计算机程序在访问网站,他们可以阻止或操纵响应(例如返回一个空结果或返回任意状态码,如 506 或甚至 418 如果他们愿意)。

这意味着:网络抓取随时可能失败。特别是如果网站所有者不希望你自动下载他们的内容,因为站点运营者随时可以想出新的方法来阻止自动访问。

如果你被允许下载内容,你将不得不做更多的工作,例如使用 Selenium Web Driver,考虑使用 cookies,使请求时间看起来像人为的,并且也许不总是使用相同的IP地址进行自动访问,使用站点的缓存 等等。

这在纯粹使用 requests 库或仅使用 curl 难以实现。所以,与其伪装成人类请求,为什么不使用浏览器并让浏览器为你执行请求呢?

这里是一个示例,演示如何使用 Selenium 的浏览器请求。这应该适用于 URL https://www.ticketweb.com/search?q=taylor+swiftdriver.find_element(by=By.TAG_NAME, value="body")。浏览器还可以通过在浏览器选项中注入 --headless 来无头运行,因此在过程中无需看到浏览器界面。

但再次强调:网络抓取随时可能失败,请仔细阅读使用条款,以确定是否被允许自动阅读页面。

顺便说一下:utf-8这里 未列为 Accept-Encoding 参数。但似乎你根本不需要它。

英文:

It seems that the request header is being critically scrutinized. I have played a bit with the request header, and e.g. this was a successful request at the time of writing this answer:

import requests

url_path = r&#39;https://www.ticketweb.com/search?q=&#39;

HEADERS = {
    &quot;Accept-Language&quot;: &quot;en-US,en&quot;,
    &quot;Accept&quot;: &quot;*/*;q=0.9&quot;,
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36&quot;
}

response = requests.get(url_path, headers=HEADERS)
response.raise_for_status()
print(response.text)

Here is a good explanation about the q-Parameter in the request header. tldr; (as far as I understood this) It indicates that the instruction is not handled quite so strictly, which you accept as the requester.

I came to the solution by copying the complete request header from a firefox request and tried to minimize it as far as I can, also played a bit with the q-Parameter as already mentioned.

EDIT: In the meanwhile this request is not working anymore

Important note

If you read the terms of use on the page, you will see something like this:

> [...] you agree that you will not:
> - Use any robot, spider [...]
> - Use any automated software or computer system to search for [...]

So it is very likely that the site owners analyzing some criteria to see if a request is made from a browser or from a machine. If they assume that a computer program is accessing the site, they can block or manipulate the response (e.g. returning an empty result or returning an arbitrary status code like 506 or even 418 if they want).

That means: Web scraping can fail at any time. Especially if the site owners don't want you to download their content automatically, because site operators can always come up with new things to prevent automated access.

If you are allowed to download the content, you will have to do more work, e.g. use selenium web driver, consider cookies, humanize the request times and maybe not always use the same IP address for the automated accesses, using caches from the site etc.

This is hard to do with purely the requests library or only using curl. So instead of faking a human request, why not using a browser and doing the request for you?

Here is an example how to request via Selenium's Browser. This should work for url https://www.ticketweb.com/search?q=taylor+swift and driver.find_element(by=By.TAG_NAME, value=&quot;body&quot;). The browser can also be used headless by injecting --headless to the browser options, so no need to see the browser UI during the process.

But again: Web scraping can fail at any time and please read carefully the terms of use if you are allowed to read the page automatically at all.

BTW: utf-8 is not listed as Accept-Encoding parameter here. But it seems that you don't need it anyways.

答案3

得分: 3

以下是代码部分的翻译:

import json
from datetime import datetime

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.50"
}

google_cache_url = "https://webcache.googleusercontent.com/search?q=cache:";


def parse_date(event_date: str) -> str:
    return (
        datetime
        .strptime(event_date, "%Y-%m-%dT%H:%M")
        .strftime("%Y-%m-%d at %H:%M")
    )


def show_performers(performers: list) -> str:
    return ", ".join([performer["name"] for performer in performers])


def parse_event(script_element: str) -> list:
    venue_events = json.loads(script_element)
    parsed = []
    for event in venue_events:
        parsed.append(
            [
                event["name"],
                show_performers(event["performer"]),
                parse_date(event["startDate"]),
                event["offers"]["availability"],
                event["url"],
            ]
        )
    return parsed


if __name__ == "__main__":
    ticket_web_url = "https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1"

    response = requests.get(google_cache_url + ticket_web_url, headers=HEADERS)
    script = (
        BeautifulSoup(response.text, "html.parser")
        .select_one("script[type='application/ld+json']")
        .string
    )

    venue_table = pd.DataFrame(
        parse_event(script),
        columns=["Event", "Performers", "When", "Status", "URL"],
    )
    print(tabulate(venue_table, headers="keys", tablefmt="psql", showindex=False))

希望这对你有所帮助。如果你需要进一步的信息,请告诉我。

英文:

You can use the Google cache url to get to the site.

https://webcache.googleusercontent.com/search?q=cache:

Also, the data for the venue and event sits in a single &lt;script&gt; element and can be easily parsed.

The link I've used: https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1

For example:

import json
from datetime import datetime

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

HEADERS = {
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) &quot;
                  &quot;AppleWebKit/537.36 (KHTML, like Gecko) &quot;
                  &quot;Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.50&quot;
}

google_cache_url = &quot;https://webcache.googleusercontent.com/search?q=cache:&quot;


def parse_date(event_date: str) -&gt; str:
    return (
        datetime
        .strptime(event_date, &quot;%Y-%m-%dT%H:%M&quot;)
        .strftime(&quot;%Y-%m-%d at %H:%M&quot;)
    )


def show_performers(performers: list) -&gt; str:
    return &quot;, &quot;.join([performer[&quot;name&quot;] for performer in performers])


def parse_event(script_element: str) -&gt; list:
    venue_events = json.loads(script_element)
    parsed = []
    for event in venue_events:
        parsed.append(
            [
                event[&quot;name&quot;],
                show_performers(event[&quot;performer&quot;]),
                parse_date(event[&quot;startDate&quot;]),
                event[&quot;offers&quot;][&quot;availability&quot;],
                event[&quot;url&quot;],
            ]
        )
    return parsed


if __name__ == &quot;__main__&quot;:
    ticket_web_url = &quot;https://www.ticketweb.com/venue/the-new-parish-oakland-ca/428995?page=1&quot;

    response = requests.get(google_cache_url + ticket_web_url, headers=HEADERS)
    script = (
        BeautifulSoup(response.text, &quot;html.parser&quot;)
        .select_one(&quot;script[type=&#39;application/ld+json&#39;]&quot;)
        .string
    )

    venue_table = pd.DataFrame(
        parse_event(script),
        columns=[&quot;Event&quot;, &quot;Performers&quot;, &quot;When&quot;, &quot;Status&quot;, &quot;URL&quot;],
    )
    print(tabulate(venue_table, headers=&quot;keys&quot;, tablefmt=&quot;psql&quot;, showindex=False))

Prints:

+-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------+
| Event                                               | Performers                                     | When                | Status   | URL                                                                                                 |
|-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------|
| Homixide Gang: Snot of Not Tour                     | Homixide Gang,  Sid Shyne, Biggaveli, Lil He77 | 2023-05-25 at 20:00 | SoldOut  | https://www.ticketweb.com/event/homixide-gang-snot-of-not-the-new-parish-tickets/13096395           |
| La Sonora Dinamita, Suenatron, El Dusty             | LA Sonora Dinamita, Suenatron, El Dusty        | 2023-05-27 at 21:00 | InStock  | https://www.ticketweb.com/event/la-sonora-dinamita-suenatron-el-the-new-parish-tickets/13220068     |
| Reggae Gold XL presents: The Give Thankz Reunion    | Reggae Gold XL                                 | 2023-05-28 at 21:00 | InStock  | https://www.ticketweb.com/event/reggae-gold-xl-presents-the-the-new-parish-tickets/13228028         |
| THE OFFICIAL OAKLAND CARNIVAL AFTER-PARTY           | Oakland Carnival, SambaFunk, Kenny Mann        | 2023-06-03 at 22:00 | InStock  | https://www.ticketweb.com/event/the-official-oakland-carnival-after-the-new-parish-tickets/13236848 |
| WARD DAVIS                                          | Ward Davis                                     | 2023-06-08 at 20:00 | InStock  | https://www.ticketweb.com/event/ward-davis-the-new-parish-tickets/13127855                          |
| Casey Veggies                                       | Casey Veggies                                  | 2023-06-09 at 20:30 | InStock  | https://www.ticketweb.com/event/casey-veggies-the-new-parish-tickets/13151618                       |
| Casey Veggies                                       | Casey Veggies                                  | 2023-06-09 at 21:00 | InStock  | https://www.ticketweb.com/event/casey-veggies-the-new-parish-tickets/13160998                       |
| Mortified presents: Morti-Pride!                    | MORTIFIED                                      | 2023-06-10 at 19:30 | InStock  | https://www.ticketweb.com/event/mortified-presents-morti-pride-the-new-parish-tickets/13126705      |
| ZelooperZ: Traptastic Tour                          | ZelooperZ                                      | 2023-06-13 at 20:00 | InStock  | https://www.ticketweb.com/event/zelooperz-traptastic-tour-the-new-parish-tickets/13205488           |
| Ab-Soul: The Intelligent Movement Tour              | Ab-Soul                                        | 2023-06-14 at 20:00 | InStock  | https://www.ticketweb.com/event/ab-soul-the-intelligent-movement-the-new-parish-tickets/13156258    |
| Ab-Soul: The Intelligent Movement Tour              | Ab-Soul                                        | 2023-06-15 at 20:00 | SoldOut  | https://www.ticketweb.com/event/ab-soul-the-intelligent-movement-the-new-parish-tickets/13108785    |
| THE COLORS OF BANG YONG GUK: THE US TOUR 2023       | BANG YONGGUK                                   | 2023-06-16 at 19:00 | InStock  | https://www.ticketweb.com/event/the-colors-of-bang-yong-the-new-parish-tickets/13115845             |
| THE COLORS OF BANG YONG GUK: THE US TOUR 2023       | Bang Yongguk                                   | 2023-06-16 at 19:00 | InStock  | https://www.ticketweb.com/event/the-colors-of-bang-yong-the-new-parish-tickets/13115565             |
| BashfortheWorld                                     | BashfortheWorld                                | 2023-06-17 at 21:00 | SoldOut  | https://www.ticketweb.com/event/bashfortheworld-the-new-parish-tickets/13116985                     |
| Frank Zappa Tribute with The Stinkfoot Orchestra    | The Stinkfoot Orchestra                        | 2023-06-23 at 20:00 | InStock  | https://www.ticketweb.com/event/frank-zappa-tribute-with-the-the-new-parish-tickets/13198478        |
| Hip Hop For The People&#39;s Health And Wellness Summit | Inspectah Deck                                 | 2023-06-25 at 21:00 | InStock  | https://www.ticketweb.com/event/hip-hop-for-the-peoples-the-new-parish-tickets/13161098             |
| 03 Greedo                                           | 03 Greedo                                      | 2023-06-28 at 20:00 | InStock  | https://www.ticketweb.com/event/03-greedo-the-new-parish-tickets/13155688                           |
| 03 Greedo                                           | 03 Greedo                                      | 2023-06-29 at 20:00 | InStock  | https://www.ticketweb.com/event/03-greedo-the-new-parish-tickets/13175908                           |
| K-Pop Mixtape Party                                 | Alawn                                          | 2023-07-01 at 20:30 | InStock  | https://www.ticketweb.com/event/k-pop-mixtape-party-the-new-parish-tickets/13241338                 |
| LOJAY - GANGSTER ROMANTIC                           | Lojay                                          | 2023-07-05 at 20:00 | InStock  | https://www.ticketweb.com/event/lojay-gangster-romantic-the-new-parish-tickets/13234278             |
+-----------------------------------------------------+------------------------------------------------+---------------------+----------+-----------------------------------------------------------------------------------------------------+

huangapple
  • 本文由 发表于 2023年3月9日 22:35:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686011.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定