Why this Twitter login UI script only works in my developer environment and not on the server?

huangapple go评论81阅读模式
英文:

Why this Twitter login UI script only works in my developer environment and not on the server?

问题

Twitter 在不久前进行了一项根本性的更改:如果未登录,您将无法查看用户的时间线。这破坏了我的所有抓取代码。下面的代码在我的开发环境中可以工作,但在服务器上不行。在我的开发环境中,它也是无界面的。正如您所看到的,我尝试添加了一些延迟来确定脚本是否运行得太快。

脚本无法找到密码输入而失败。在填写用户名/电子邮件输入并按下“下一步”按钮后,该输入会出现在视图中。在填写密码后,我需要单击“登录”按钮。

这是一个旧版本的 Selenium,因为服务器由于技术债务滞后(它是一个 IaaS)。我仍在使用同样古老的 Selenium,但我的 Firefox 是最新的。


只是一个小小的跟进:
整个 API 定价 -> 抓取(我在 5 月 1 日预测到了) -> 抓取防止斗争不必要地达到了用户层面:https://www.cnbc.com/2023/07/03/users-flock-to-twitter-competitor-bluesky-after-elon-musk-imposes-rate-limits.html
恭喜!

英文:

Twitter made a fundamental change not so long ago: you cannot view a user's timeline any more if you are not logged in. That broke all my scraping code. The code bellow works on my developer environment, but not on the server. In my developer environment it's also headless. As you can see I tried to sprinkle in sleeps to figure out if the script is moving too fast.

The script fails by not finding the password input. That input comes into view after filling in the username / email input and pressing the Next button. After filling in the password I'd need to click the "Log in" button.

TWITTER_URL_BASE = "https://twitter.com/"
SELENIUM_INSTANCE_WAIT = 1

@classmethod
def _get_driver(cls):
    driver = None

    firefox_options = FirefoxOptions()
    firefox_options.headless = True
    firefox_options.add_argument("width=1920")
    firefox_options.add_argument("height=1080")
    firefox_options.add_argument("window-size=1920,1080")
    firefox_options.add_argument("disable-gpu")
    # https://stackoverflow.com/questions/24653127/selenium-error-no-display-specified
    # export MOZ_HEADLESS=1
    firefox_options.binary_location = "/usr/bin/firefox"
    # firefox_options.set_preference("extensions.enabledScopes", 0)
    # firefox_options.set_preference("gfx.webrender.all", False)
    # firefox_options.set_preference("layers.acceleration.disabled", True)
    firefox_binary = FirefoxBinary("/usr/bin/firefox")
    firefox_profile = FirefoxProfile()
    firefox_options.binary = "/usr/bin/firefox"  # firefox_binary
    firefox_options.profile = firefox_profile
    capabilities = DesiredCapabilities.FIREFOX.copy()
    capabilities["pageLoadStrategy"] = "normal"
    firefox_options._caps = capabilities
    try:
        driver = webdriver.Firefox(
            firefox_profile=firefox_profile,
            firefox_binary=firefox_binary,
            options=firefox_options,
            desired_capabilities=capabilities,
        )
    except Exception as e:
        cls.log_response("_get_driver", 500, "Crash: {}".format(e))
        cls.log_response("_get_driver", 500, traceback.format_exc())

    return driver

def _login_scraper_user(cls, driver, scraper_account):
    driver.implicitly_wait(5)
    driver.get(TWITTER_URL_BASE)
    WebDriverWait(driver, 10).until(
        lambda dr: dr.execute_script("return document.readyState") == "complete"
    )
    time.sleep(SELENIUM_INSTANCE_WAIT)
    username_inputs = driver.find_elements_by_css_selector("input[name='text']")
    if not username_inputs:
        return False

    username_input_parent = (
        username_inputs[0].find_element_by_xpath("..").find_element_by_xpath("..")
    )
    username_input_parent.click()
    time.sleep(SELENIUM_INSTANCE_WAIT)
    username_inputs[0].click()
    time.sleep(SELENIUM_INSTANCE_WAIT)
    username_inputs[0].send_keys(scraper_account["username"])
    time.sleep(SELENIUM_INSTANCE_WAIT)
    next_buttons = driver.find_elements_by_xpath('//span[text()="Next"]')
    if not next_buttons:
        return False

    next_buttons[0].click()
    time.sleep(SELENIUM_INSTANCE_WAIT)

    password_inputs = driver.find_elements_by_css_selector("input[name='password']")
    if not password_inputs:
        return False

    password_input_parent = (
        password_inputs[0].find_element_by_xpath("..").find_element_by_xpath("..")
    )
    password_input_parent.click()
    time.sleep(SELENIUM_INSTANCE_WAIT)
    password_inputs[0].click()
    time.sleep(SELENIUM_INSTANCE_WAIT)
    password_inputs[0].send_keys(scraper_account["password"])
    time.sleep(SELENIUM_INSTANCE_WAIT)
    login_buttons = driver.find_elements_by_xpath('//span[text()="Log in"]')
    if not login_buttons:
        return False

    login_buttons[0].click()
    time.sleep(SELENIUM_INSTANCE_WAIT)

    if driver.find_elements_by_xpath(
        '//span[text()="Boost your account security"]'
    ):
        close_buttons = driver.find_elements_by_css_selector(
            "div[data-testid='app-bar-close']"
        )
        if not close_buttons:
            return False

        close_buttons[0].click()

    driver.implicitly_wait(0)
    return True

This is an old version of Selenium because the server lags behind due to technical debt (it's an IaaS). I'm using the same ancient Selenium, however my Firefox is fresh.


Just a little follow-up:
The whole API pricing -> scraping (I predicted this on the 1st of May) -> scrape prevention fight unnecessarily reached the user level: https://www.cnbc.com/2023/07/03/users-flock-to-twitter-competitor-bluesky-after-elon-musk-imposes-rate-limits.html
Congratulations!

答案1

得分: 0

在分析情况后,似乎可能的原因是爬虫账户被标记了。

为了保护您的账户免受可疑活动的影响,我们已向 tw************@g****.*** 发送了确认码。请在下方输入以登录。
为什么要我提供这个信息?

我尝试了一些防止爬取的措施,但最终需要一个账户使得 Twitter 能够轻松区分爬虫。我不会去注册新的 Twitter 账户,每次都用不同的电子邮件来进行爬取。这可能已经宣告失败了。

Twitter 的新 API 计划每年最多需要花费 250 万美元

我甚至注意到微软都决定不支付:微软拒绝支付 Twitter API 费用激怒了 Elon Musk

可能新的每月 5000 美元的级别也不会有所帮助。


更新:这整个混乱已经影响到了普通用户。我曾预测人们会进行爬取,我也预测了反爬取措施。但我从未想到这个问题会影响到最终用户。这太可悲了!摇头!
https://www.cnbc.com/2023/07/03/users-flock-to-twitter-competitor-bluesky-after-elon-musk-imposes-rate-limits.html

英文:

After analyzing the situation, it seems like that the possible reason is that the scraper accounts get flagged

> In order to protect your account from suspicious activity, we've sent a confirmation code to tw************@g****.***. Enter it below to sign in.
> Why am I being asked for this information?

I tried to apply some anti-scrape prevention measures, but ultimately the requirement of an account makes it super simple for Twitter to tell apart scrapers. I won't go to the length of farming email accounts and register a new Twitter every time for a scrape. It's probably game over you guys.

Twitter's new API plan costs up to $2.5 million per year

I mean even Microsoft decided not to pay: Microsoft’s refusal to pay for Twitter’s API has outraged Elon Musk

Probably the new $5000/mo tier won't help either.


Update: this whole shenanigan got to the point that it even affects regular users. I predicted that people will scrape, I also predicted anti scraping measures and anti scrape prevention measures. But I NEVER thought this whole issue will get to the surface of the end users. This is pathetic! Facepalm!
https://www.cnbc.com/2023/07/03/users-flock-to-twitter-competitor-bluesky-after-elon-musk-imposes-rate-limits.html

huangapple
  • 本文由 发表于 2023年7月3日 02:49:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76600347.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定