如何使Selenium-Wire执行我期望和需要的间接GraphQL AJAX请求?

huangapple go评论49阅读模式
英文:

How to make Selenium-Wire perform an indirect GraphQL AJAX request I expect and need?

问题

背景故事:我需要从附加的Twitter媒体中获取已标记Twitter用户的句柄。不幸的是,目前没有API方法可以实现这一点(请参阅https://twittercommunity.com/t/how-to-get-tags-of-a-media-in-a-tweet/185614 和https://github.com/twitterdev/open-evolution/issues/34)。我别无选择,只能进行爬取,这是一个示例URL:https://twitter.com/justinwood_/status/1626275168157851650/media_tags。这是当你在父级推文的媒体下点击标签链接时弹出的页面:https://twitter.com/justinwood_/status/1626275168157851650/

React生成的DOM非常复杂且难看,但是可以进行爬取,然而我不想使用任何账户登录以避免被封禁。不幸的是,当你在无痕窗口中访问https://twitter.com/justinwood_/status/1626275168157851650/media_tags时,弹出窗口完全为空。然而,当我查看网络请求时,/TweetDetail GraphQL端点充满了关于匿名页面访问的消息,幸运的是,尽管如此,它仍然包含我需要的句柄列表。

所以我需要的是一个能够处理JavaScript并捕获该特定GraphQL调用响应的爬虫。Selenium使用无头Chrome底层,因此它能够处理JavaScript,而Selenium-Wire则提供了捕获响应的能力。

不幸的是,我的Selenium-Wire脚本只有TweetResultByRestIdUsersByRestId GraphQL请求,但是缺少TweetDetail。我不知道该如何调整才能使所有请求发生。我尝试了大量的Chrome选项。以下是我的脚本的一个变体:

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless") # for Jenkins
chrome_options.add_argument("--disable-dev-shm-usage") # Jenkins
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument('--window-size=1900,1080')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')

selenium_options = {
    'request_storage_base_dir': '/tmp', # Use /tmp to store captured data
    'exclude_hosts': ''
}

ser = Service('/usr/bin/chromedriver')
ser.service_args=["--verbose", "--log-path=test.log"]

driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)

tweet_id = "1626275168157851650"
twitter_media_url = f"https://twitter.com/justinwood_/status/{tweet_id}/media_tags"
driver.get(twitter_media_url)
driver.wait_for_request("/TweetDetail", timeout=10)
英文:

Background story: I need to obtain the handles of the tagged Twitter users from an attached Twitter media. There's no current API method to do that unfortunately (see https://twittercommunity.com/t/how-to-get-tags-of-a-media-in-a-tweet/185614 and https://github.com/twitterdev/open-evolution/issues/34).
I have no other choice but to scrape, this is an example URL: https://twitter.com/justinwood_/status/1626275168157851650/media_tags. This is the page which pops up when you click on the tags link under the media of the parent Tweet: https://twitter.com/justinwood_/status/1626275168157851650/

如何使Selenium-Wire执行我期望和需要的间接GraphQL AJAX请求?

The React generated DOM is deep and ugly, but would be scrapeable, however I do not want to log in with any account to get banned. Unfortunately when you visit https://twitter.com/justinwood_/status/1626275168157851650/media_tags in an Incognito window the popup shows up dead empty. However when I dig into the network requests the /TweetDetail GraphQL endpoint is full of messages about the anonymous page visit, fortunately it still contains the list of handles I need despite of all of this.

如何使Selenium-Wire执行我期望和需要的间接GraphQL AJAX请求?

So what I need to have is a scraper which is able to process JavaScript, and capture the response for that specific GraphQL call. Selenium uses a headless Chrome under the hood, so it is able to process JavaScript, and Selenium-Wire offers the ability to capture the response.

Unfortunately my crafted Selenium-Wire script only has the TweetResultByRestId and UsersByRestId GraphQL requests but is missing the TweetDetail. I don't know what to tweak to make all the requests to happen. I iterated over a ton of Chrome options. Here is a variation of my script:

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless") # for Jenkins
chrome_options.add_argument("--disable-dev-shm-usage") # Jenkins
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument('--window-size=1900,1080')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')

selenium_options = {
    'request_storage_base_dir': '/tmp', # Use /tmp to store captured data
    'exclude_hosts': ''
}

ser = Service('/usr/bin/chromedriver')
ser.service_args=["--verbose", "--log-path=test.log"]

driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)

tweet_id = "1626275168157851650"
twitter_media_url = f"https://twitter.com/justinwood_/status/{tweet_id}/media_tags"
driver.get(twitter_media_url)
driver.wait_for_request("/TweetDetail", timeout=10)

Any ideas?

答案1

得分: 0

显然,我似乎更需要爬取父推文的URL https://twitter.com/justinwood_/status/1626275168157851650/,而现在看起来我的期望的GraphQL调用正在发生。可能我在尝试100种组合时感到困惑。

英文:

Apparently it looks like I'd rather need to scrape the parent Tweet URL https://twitter.com/justinwood_/status/1626275168157851650/ and right now it seems my craved GraphQL call happens. Probably I got confused while trying 100 combinations.

huangapple
  • 本文由 发表于 2023年2月19日 05:06:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75496395.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定