英文:
How to make Selenium-Wire perform an indirect GraphQL AJAX request I expect and need?
问题
背景故事:我需要从附加的Twitter媒体中获取已标记Twitter用户的句柄。不幸的是,目前没有API方法可以实现这一点(请参阅https://twittercommunity.com/t/how-to-get-tags-of-a-media-in-a-tweet/185614 和https://github.com/twitterdev/open-evolution/issues/34)。我别无选择,只能进行爬取,这是一个示例URL:https://twitter.com/justinwood_/status/1626275168157851650/media_tags。这是当你在父级推文的媒体下点击标签链接时弹出的页面:https://twitter.com/justinwood_/status/1626275168157851650/
React生成的DOM非常复杂且难看,但是可以进行爬取,然而我不想使用任何账户登录以避免被封禁。不幸的是,当你在无痕窗口中访问https://twitter.com/justinwood_/status/1626275168157851650/media_tags时,弹出窗口完全为空。然而,当我查看网络请求时,/TweetDetail
GraphQL端点充满了关于匿名页面访问的消息,幸运的是,尽管如此,它仍然包含我需要的句柄列表。
所以我需要的是一个能够处理JavaScript并捕获该特定GraphQL调用响应的爬虫。Selenium使用无头Chrome底层,因此它能够处理JavaScript,而Selenium-Wire则提供了捕获响应的能力。
不幸的是,我的Selenium-Wire脚本只有TweetResultByRestId
和UsersByRestId
GraphQL请求,但是缺少TweetDetail
。我不知道该如何调整才能使所有请求发生。我尝试了大量的Chrome选项。以下是我的脚本的一个变体:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless") # for Jenkins
chrome_options.add_argument("--disable-dev-shm-usage") # Jenkins
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument('--window-size=1900,1080')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')
selenium_options = {
'request_storage_base_dir': '/tmp', # Use /tmp to store captured data
'exclude_hosts': ''
}
ser = Service('/usr/bin/chromedriver')
ser.service_args=["--verbose", "--log-path=test.log"]
driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
tweet_id = "1626275168157851650"
twitter_media_url = f"https://twitter.com/justinwood_/status/{tweet_id}/media_tags"
driver.get(twitter_media_url)
driver.wait_for_request("/TweetDetail", timeout=10)
英文:
Background story: I need to obtain the handles of the tagged Twitter users from an attached Twitter media. There's no current API method to do that unfortunately (see https://twittercommunity.com/t/how-to-get-tags-of-a-media-in-a-tweet/185614 and https://github.com/twitterdev/open-evolution/issues/34).
I have no other choice but to scrape, this is an example URL: https://twitter.com/justinwood_/status/1626275168157851650/media_tags. This is the page which pops up when you click on the tags link under the media of the parent Tweet: https://twitter.com/justinwood_/status/1626275168157851650/
The React generated DOM is deep and ugly, but would be scrapeable, however I do not want to log in with any account to get banned. Unfortunately when you visit https://twitter.com/justinwood_/status/1626275168157851650/media_tags in an Incognito window the popup shows up dead empty. However when I dig into the network requests the /TweetDetail
GraphQL endpoint is full of messages about the anonymous page visit, fortunately it still contains the list of handles I need despite of all of this.
So what I need to have is a scraper which is able to process JavaScript, and capture the response for that specific GraphQL call. Selenium uses a headless Chrome under the hood, so it is able to process JavaScript, and Selenium-Wire offers the ability to capture the response.
Unfortunately my crafted Selenium-Wire script only has the TweetResultByRestId
and UsersByRestId
GraphQL requests but is missing the TweetDetail
. I don't know what to tweak to make all the requests to happen. I iterated over a ton of Chrome options. Here is a variation of my script:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless") # for Jenkins
chrome_options.add_argument("--disable-dev-shm-usage") # Jenkins
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument('--window-size=1900,1080')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')
selenium_options = {
'request_storage_base_dir': '/tmp', # Use /tmp to store captured data
'exclude_hosts': ''
}
ser = Service('/usr/bin/chromedriver')
ser.service_args=["--verbose", "--log-path=test.log"]
driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
tweet_id = "1626275168157851650"
twitter_media_url = f"https://twitter.com/justinwood_/status/{tweet_id}/media_tags"
driver.get(twitter_media_url)
driver.wait_for_request("/TweetDetail", timeout=10)
Any ideas?
答案1
得分: 0
显然,我似乎更需要爬取父推文的URL https://twitter.com/justinwood_/status/1626275168157851650/,而现在看起来我的期望的GraphQL调用正在发生。可能我在尝试100种组合时感到困惑。
英文:
Apparently it looks like I'd rather need to scrape the parent Tweet URL https://twitter.com/justinwood_/status/1626275168157851650/ and right now it seems my craved GraphQL call happens. Probably I got confused while trying 100 combinations.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论