2023年2月19日 05:06:07go评论82阅读模式

英文:

How to make Selenium-Wire perform an indirect GraphQL AJAX request I expect and need?

问题

背景故事：我需要从附加的Twitter媒体中获取已标记Twitter用户的句柄。不幸的是，目前没有API方法可以实现这一点（请参阅https://twittercommunity.com/t/how-to-get-tags-of-a-media-in-a-tweet/185614 和https://github.com/twitterdev/open-evolution/issues/34)。我别无选择，只能进行爬取，这是一个示例URL：https://twitter.com/justinwood_/status/1626275168157851650/media_tags。这是当你在父级推文的媒体下点击标签链接时弹出的页面：https://twitter.com/justinwood_/status/1626275168157851650/

React生成的DOM非常复杂且难看，但是可以进行爬取，然而我不想使用任何账户登录以避免被封禁。不幸的是，当你在无痕窗口中访问https://twitter.com/justinwood_/status/1626275168157851650/media_tags时，弹出窗口完全为空。然而，当我查看网络请求时，/TweetDetail GraphQL端点充满了关于匿名页面访问的消息，幸运的是，尽管如此，它仍然包含我需要的句柄列表。

所以我需要的是一个能够处理JavaScript并捕获该特定GraphQL调用响应的爬虫。Selenium使用无头Chrome底层，因此它能够处理JavaScript，而Selenium-Wire则提供了捕获响应的能力。

不幸的是，我的Selenium-Wire脚本只有TweetResultByRestId和UsersByRestId GraphQL请求，但是缺少TweetDetail。我不知道该如何调整才能使所有请求发生。我尝试了大量的Chrome选项。以下是我的脚本的一个变体：

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless") # for Jenkins
chrome_options.add_argument("--disable-dev-shm-usage") # Jenkins
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument('--window-size=1900,1080')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')
selenium_options = {
    'request_storage_base_dir': '/tmp', # Use /tmp to store captured data
    'exclude_hosts': ''
}
ser = Service('/usr/bin/chromedriver')
ser.service_args=["--verbose", "--log-path=test.log"]
driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
tweet_id = "1626275168157851650"
twitter_media_url = f"https://twitter.com/justinwood_/status/{tweet_id}/media_tags"
driver.get(twitter_media_url)
driver.wait_for_request("/TweetDetail", timeout=10)

英文:

Background story: I need to obtain the handles of the tagged Twitter users from an attached Twitter media. There's no current API method to do that unfortunately (see https://twittercommunity.com/t/how-to-get-tags-of-a-media-in-a-tweet/185614 and https://github.com/twitterdev/open-evolution/issues/34).
I have no other choice but to scrape, this is an example URL: https://twitter.com/justinwood_/status/1626275168157851650/media_tags. This is the page which pops up when you click on the tags link under the media of the parent Tweet: https://twitter.com/justinwood_/status/1626275168157851650/

The React generated DOM is deep and ugly, but would be scrapeable, however I do not want to log in with any account to get banned. Unfortunately when you visit https://twitter.com/justinwood_/status/1626275168157851650/media_tags in an Incognito window the popup shows up dead empty. However when I dig into the network requests the /TweetDetail GraphQL endpoint is full of messages about the anonymous page visit, fortunately it still contains the list of handles I need despite of all of this.

So what I need to have is a scraper which is able to process JavaScript, and capture the response for that specific GraphQL call. Selenium uses a headless Chrome under the hood, so it is able to process JavaScript, and Selenium-Wire offers the ability to capture the response.

Unfortunately my crafted Selenium-Wire script only has the TweetResultByRestId and UsersByRestId GraphQL requests but is missing the TweetDetail. I don't know what to tweak to make all the requests to happen. I iterated over a ton of Chrome options. Here is a variation of my script:

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(&quot;--disable-extensions&quot;)
chrome_options.add_argument(&quot;--disable-gpu&quot;)
chrome_options.add_argument(&quot;--no-sandbox&quot;)
chrome_options.add_argument(&quot;--headless&quot;) # for Jenkins
chrome_options.add_argument(&quot;--disable-dev-shm-usage&quot;) # Jenkins
chrome_options.add_argument(&#39;--start-maximized&#39;)
chrome_options.add_argument(&#39;--window-size=1900,1080&#39;)
chrome_options.add_argument(&#39;--ignore-certificate-errors-spki-list&#39;)
chrome_options.add_argument(&#39;--ignore-ssl-errors&#39;)
selenium_options = {
    &#39;request_storage_base_dir&#39;: &#39;/tmp&#39;, # Use /tmp to store captured data
    &#39;exclude_hosts&#39;: &#39;&#39;
}
ser = Service(&#39;/usr/bin/chromedriver&#39;)
ser.service_args=[&quot;--verbose&quot;, &quot;--log-path=test.log&quot;]
driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
tweet_id = &quot;1626275168157851650&quot;
twitter_media_url = f&quot;https://twitter.com/justinwood_/status/{tweet_id}/media_tags&quot;
driver.get(twitter_media_url)
driver.wait_for_request(&quot;/TweetDetail&quot;, timeout=10)

Any ideas?

答案1

得分: 0

显然，我似乎更需要爬取父推文的URL https://twitter.com/justinwood_/status/1626275168157851650/，而现在看起来我的期望的GraphQL调用正在发生。可能我在尝试100种组合时感到困惑。

英文:

Apparently it looks like I'd rather need to scrape the parent Tweet URL https://twitter.com/justinwood_/status/1626275168157851650/ and right now it seems my craved GraphQL call happens. Probably I got confused while trying 100 combinations.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使Selenium-Wire执行我期望和需要的间接GraphQL AJAX请求？

问题

答案1

无法使用JSONObject更改Json中的属性值。

如何通过Python Selenium打印链接列表。

Selenium 通过类选择 Div，没有这样的元素

如何避免在测试注释中添加重试？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。