2023年7月11日 01:38:25go评论205阅读模式

英文:

Unable to scrape a hidden phone number from a webpage using the requests module

问题

我试图使用 requests 模块来爬取一个位于 [网页](https://www.olx.ro/d/oferta/trepte-glafuri-plinte-IDh12rV.html) 中间的电话号码，该号码在点击旁边的按钮后才会显示，显示为英文的 `To Miss` 或者罗马尼亚语的 `Arata`。
通常情况下，当我按原样运行脚本时，它会在这一行报一个“KeyError”错误：`params[&#39;context&#39;] = resp.json()[&#39;context&#39;]`。
然而，当我在 Chrome 浏览器中打开链接，手动点击按钮以显示隐藏的电话号码，然后运行脚本，它可以无缺地获取电话号码。
我如何使用脚本来在无需手动干预的情况下爬取电话号码？

英文:

I'm trying to use the requests module to scrape a phone number located in the middle of a webpage, revealed upon clicking on the button right next to it, visible as To Miss in English or Arata in Romanian.

Normally, when I run the script as is, it throws a "KeyError" error on this line: params['context'] = resp.json()['context'].

However, when I open that link in Chrome Browser, manually click on the button next to the hidden phone number to reveal it, and run the script afterwards, it fetches the phone number flawlessly.

How can I use the script to scrape the phone number without manual intervention?

import requests
from bs4 import BeautifulSoup
link = &#39;https://www.olx.ro/d/oferta/trepte-glafuri-plinte-IDh12rV.html&#39;
phone_link_base = &#39;https://www.olx.ro/api/v1/offers/{}/limited-phones/&#39;
headers = {
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36&#39;,
    &#39;Accept&#39;: &#39;*/*&#39;,
    &#39;Accept-Encoding&#39;: &#39;gzip, deflate, br&#39;,
    &#39;Accept-Language&#39;: &#39;en-US,en;q=0.9,bn;q=0.8&#39;,
    &#39;Origin&#39;: &#39;https://www.olx.ro&#39;,
    &#39;Referer&#39;: &#39;https://www.olx.ro/&#39;,
    &#39;Authorization&#39;: &#39;Bearer d4305f7b7375e182187d3d4bc5226b688c9ea131&#39;,
}
payload = {&quot;action&quot;:&quot;reveal_phone_number&quot;,&quot;aud&quot;:&quot;atlas&quot;,&quot;actor&quot;:{&quot;username&quot;:&quot;700663656&quot;},&quot;scene&quot;:{&quot;origin&quot;:&quot;www.olx.ro&quot;,&quot;sitecode&quot;:&quot;olxro&quot;,&quot;ad_id&quot;:&quot;&quot;}}
params = {&quot;context&quot;:&quot;&quot;,&quot;response&quot;:&quot;&quot;}
ad_id = &#39;251445460&#39;
with requests.Session() as s:
    s.headers.update(headers)
    payload[&#39;scene&#39;][&#39;ad_id&#39;] = ad_id
    resp = s.post(&#39;https://friction.olxgroup.com/challenge&#39;,json=payload)
    params[&#39;context&#39;] = resp.json()[&#39;context&#39;]
    res = s.post(&#39;https://friction.olxgroup.com/exchange&#39;,json=params)
    headers[&#39;Friction-Token&#39;] = res.json()[&#39;token&#39;]
    s.headers.update(headers)
    response = s.get(phone_link_base.format(ad_id))
    print(response.json()[&#39;data&#39;][&#39;phones&#39;])

Here is another link with ad id number for your test:

another_link = &#39;https://www.olx.ro/d/oferta/echipa-meseriasi-constructi-execut-case-la-rosu-finisaje-interioare-IDdT0s8.html&#39;
ad_id = &#39;205202152&#39;

答案1

得分: 3

为了扩展评论，很明显requests不是浏览器，不会执行JavaScript，正如在psf/requests问题6040中所提到的一样。当您手动点击“<kbd>Arata</kbd>”按钮时，服务器会发送一个AJAX请求来获取电话号码，这不是您当前requests方法正在复制的内容。

OLX也可能采用复杂的措施来防止网络抓取，包括动态内容加载和防爬虫机制。例如，您尝试抓取的URL (https://www.olx.ro/api/v1/offers/{}/limited-phones/) 在“Disallow: /api/”路径下，根据robots.txt文件，不应被机器人访问（包括网络抓取工具）。

回到您的代码：使用静态标头和有效负载，特别是使用硬编码的授权令牌，这不是一个好的做法，而且大多数情况下不起作用，因为它们可能会发生更改。

因此，正如评论中所述，Selenium可以自动化浏览器，允许您与基于JavaScript的网站进行交互，点击按钮并等待响应，前提是您首先登录。

为此，您需要了解登录页面的结构。通常情况下，您应该找到用户名和密码输入的元素，填写您的凭据，然后单击登录按钮。登录后，您可以从Cookie中提取令牌（或者继续使用Selenium，在后续的URL查询中会自动使用它）。

以下是使用Selenium进行登录的通用示例：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
# 替换为您自己的chromedriver路径
driver = webdriver.Chrome('/path/to/chromedriver')
# 转到登录页面
driver.get('https://www.olx.ro/account/?origin=header&amp;backUrl=%2F')
# 输入用户名和密码
username = driver.find_element(By.ID, 'userEmail')
password = driver.find_element(By.ID, 'userPass')
username.send_keys('your_username')
password.send_keys('your_password')
# 单击登录按钮
login_button = driver.find_element(By.ID, 'se_userLogin')
login_button.click()
# 等待登录完成
time.sleep(5)
# 假设Cookie的名称是'access_token'
cookies = driver.get_cookies()
for cookie in cookies:
    if cookie['name'] == 'access_token':
        print(cookie['value'])
driver.quit()

请将'your-username'和'your-password'替换为您的实际用户名和密码。您还需要替换'username'，'password'和'login-button'为登录页面上这些元素的实际ID，您可以通过检查网页来找到它们。同样，对于'access_token'，您需要将其替换为包含访问令牌的Cookie的实际名称。

然而，要确保自动化登录用户帐户的过程可能会违反网站的服务条款。如果一个网站使用这些令牌来阻止抓取，显然他们不希望他们的网站被抓取。这不仅可能违反他们的服务条款，而且根据您的管辖权，可能会是非法的。

OLX的服务条款包括：

禁止将网站上可用的数据和其他信息进行聚合和处理，以进一步分发到第三方网站和互联网之外。此外，未经OLX明确事先同意，不得使用该网站和OLX标志，包括特征性的图形元素。”

（或者我认为是这样：我不是罗马尼亚人，也不是律师。）

但是，如果您确保可以自动化登录部分，那么您可以将以下内容添加到脚本中：

# 转到特定的广告页面
driver.get('https://www.olx.ro/d/oferta/trepte-glafuri-plinte-IDh12rV.html')
# 等待页面加载
time.sleep(5)
# 单击“Arata”按钮
show_phone_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-action='show-phone']")))
show_phone_button.click()
# 等待电话号码加载
time.sleep(5)
# 提取电话号码
phone_number = driver.find_element(By.CSS_SELECTOR, "span[class='xxxx-large margintop7']")
print('电话号码:', phone_number.text)

再次确保这是一个针对特定、本地和有限使用的脚本。

英文:

To expand on the comments, it is clear that requests is not a browser and will not execute JavaScript, as mentioned in psf/requests issue 6040.
When you click on the "<kbd>Arata</kbd>" button manually, the server sends an AJAX request to retrieve the phone number, and that is not something you are replicating with your current requests approach.

It is also possible OLX utilizes complex measures to prevent web scraping, including dynamic content loading and anti-bot mechanisms.
For instance, the URL you are trying to scrape (https://www.olx.ro/api/v1/offers/{}/limited-phones/) is under the "Disallow: /api/" path, which according to the robots.txt file, should not be accessed by robots (which includes web scrapers).

Back to your code: using static headers and payload, especially using a hard-coded authorization token, is not good practice and will not work most of the time, as they are subject to change.

So, as commented, Selenium would automate a browser and allow you to interact with JavaScript-based websites, clicking on buttons, and waiting for responses, provided you are logged on first.

For that, you need to understand the structure of the login page. In general, you should find the elements for the username and password inputs, fill them with your credentials, and then click the login button. After login, you can extract the token from the cookies (or just continue using Selenium, which should use it automatically in subsequent URL queries).

Here is a general example on how to login using Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
# replace with the path to your own chromedriver
driver = webdriver.Chrome(&#39;/path/to/chromedriver&#39;)
# Go to the login page
driver.get(&#39;https://www.olx.ro/account/?origin=header&amp;backUrl=%2F&#39;)
# Enter username and password
username = driver.find_element(By.ID, &#39;userEmail&#39;)
password = driver.find_element(By.ID, &#39;userPass&#39;)
username.send_keys(&#39;your_username&#39;)
password.send_keys(&#39;your_password&#39;)
# Click login button
login_button = driver.find_element(By.ID, &#39;se_userLogin&#39;)
login_button.click()
# wait for login to complete
time.sleep(5)
# assuming the cookie name is &#39;access_token&#39;
cookies = driver.get_cookies()
for cookie in cookies:
    if cookie[&#39;name&#39;] == &#39;access_token&#39;:
        print(cookie[&#39;value&#39;])
driver.quit()

Replace 'your-username' and 'your-password' with your actual username and password. You also need to replace 'username', 'password' and 'login-button' with the actual IDs of those elements on the login page, which you can find by inspecting the webpage. The same goes for 'access_token', you need to replace it with the actual name of the cookie containing the access token.

However, make sure that automating the process of logging into a user account may violate the website's terms of service.
If a website is using such tokens to prevent scraping, it is clear they do not want their site to be scraped. Not only may it be against their terms of service, but it could also be illegal, depending on your jurisdiction.

OLX's ToS (Term of Services) does include:

> Any aggregation and processing of data and other information available on the Website for the purpose of further distributing them to third parties on other websites and outside the Internet is prohibited. Also, the use of the Website and OLX signs, including graphic elements characteristic, without the express and prior consent of OLX is prohibited."

(Or so I believe: I am not from Romania, and "I Am Not A Lawyer")

But if you made sure you can automate the login part, then you can add to the script:

# Go to the specific ad page
driver.get(&#39;https://www.olx.ro/d/oferta/trepte-glafuri-plinte-IDh12rV.html&#39;)
# wait for page to load
time.sleep(5)
# Click the &quot;Arata&quot; button
show_phone_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, &quot;button[data-action=&#39;show-phone&#39;]&quot;)))
show_phone_button.click()
# wait for phone number to load
time.sleep(5)
# Extract the phone number
phone_number = driver.find_element(By.CSS_SELECTOR, &quot;span[class=&#39;xxxx-large margintop7&#39;]&quot;)
print(&#39;Phone number:&#39;, phone_number.text)

Make sure again this is for a punctual, local and limited use.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法使用 requests 模块从网页中提取隐藏的电话号码。

问题

答案1

Google搜索结果与抓取Google结果不同，如何使它们相同？

如何在Python中定义自定义可调用类型

在mmsegmentation上训练自定义数据集。

这段内容的翻译为： “能否有人解释，为什么这段代码不起作用”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。