Selenium,无法获取产品的所有价格和日期信息的问题

huangapple go评论63阅读模式
英文:

Selenium, the problem of not being able to obtain all price and date information of a product

问题

以下是在代码中解决价格和日期信息为空问题的部分:

# Get the prices
prices = []
dates = []

# Wait for the initial price element to appear
wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='price-area']//strong[@itemprop='price']")))

while len(prices) < months:
    price_elems = driver.find_elements(By.XPATH, "//div[@class='price-area']//strong[@itemprop='price']")
    date_elems = driver.find_elements(By.XPATH, "//div[@class='product-info']//span[@class='product-info-date']")

    for price_elem, date_elem in zip(price_elems, date_elems):
        price = float(price_elem.text.replace('.', '').replace(',', '.'))
        date = pd.to_datetime(date_elem.text, format='%d %B %Y, %H:%M')
        prices.append(price)
        dates.append(date)

    next_button = driver.find_element(By.XPATH, "//a[@class='page-next']")
    if 'disabled' in next_button.get_attribute('class'):
        break
    else:
        driver.execute_script("arguments[0].click();", next_button)

这段代码解决了在获取价格和日期信息时可能出现的问题。在等待初始价格元素出现后,代码将循环获取价格和日期,并将它们添加到相应的列表中。同时,它也处理了翻页的情况,以便获取更多的价格和日期数据。

请注意,代码中还包含了其他部分,用于选择最后X个月的数据并创建价格变化的图表。这些部分似乎没有问题,前提是在获取价格和日期数据时没有出现错误。如果仍然遇到问题,请查看是否有其他与页面结构或元素定位相关的问题。

英文:

the code that follows the price of the cheapest price product searched on the site at a certain time at a certain time interval for the same seller:

import pandas as pd
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import matplotlib.pyplot as plt
from time import sleep
import inspect
import os
from bs4 import BeautifulSoup
import requests


# Get the search term and tracking period from the user
search_term = input(&quot;Please enter the name of the product you want to search: &quot;)
months =input(&quot;Please enter the number of months you want to track the product: &quot;)

# To ensure that the user enters a non-string value 
while not months.isdigit():
    print(&quot;Warning: Please enter a valid integer value for the number of months.&quot;)
    months = input(&quot;Please enter the number of months you want to track the product: &quot;)
months = int(months)


# Start the web driver and go to the Hepsiburada homepage
options = uc.ChromeOptions()
options.add_argument(&#39;--blink-settings=imagesEnabled=false&#39;) # disable images for loading of page faster
options.add_argument(&#39;--disable-notifications&#39;)
prefs = {&quot;profile.default_content_setting_values.notifications&quot; : 2}
options.add_experimental_option(&quot;prefs&quot;,prefs)
driver = uc.Chrome(options=options)

url = &#39;https://www.hepsiburada.com/&#39;
driver.get(url)
wait = WebDriverWait(driver, 15)

# close cookies bar
wait.until(EC.element_to_be_clickable((By.ID, &#39;onetrust-accept-btn-handler&#39;))).click()

# Enter the search term in the search box and press Enter
search_box = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, &#39;theme-IYtZzqYPto8PhOx3ku3c&#39;)))
search_box.send_keys(search_term + Keys.RETURN)



# load all products
number_of_products = int(wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, &#39;searchResultSummaryBar-AVnHBWRNB0_veFy34hco&#39;)))[1].text)
### visibility_of_all_elements_located is a wait strategy in Selenium that checks if all elements of a certain type are visible on the page and waits until they become visible before continuing.


number_of_loaded_products = 0
while number_of_loaded_products &lt; number_of_products:
    loaded_products = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, &#39;li[class*=productListContent][id]&#39;)))
    number_of_loaded_products = len(loaded_products)
    driver.execute_script(&#39;arguments[0].scrollIntoView({block: &quot;center&quot;, behavior: &quot;smooth&quot;});&#39;, loaded_products[-1])

# Get the link, name, price and seller of all the products
product = {key:[] for key in [&#39;name&#39;,&#39;price&#39;,&#39;seller&#39;,&#39;url&#39;]}
product[&#39;name&#39;]  = [h3.text for h3 in driver.find_elements(By.CSS_SELECTOR, &#39;h3[data-test-id=product-card-name]&#39;)]
product[&#39;url&#39;]   = [a.get_attribute(&#39;href&#39;) for a in driver.find_elements(By.CSS_SELECTOR, &#39;a[class*=ProductCard]&#39;)]
product[&#39;price&#39;] = [float(div.text.replace(&#39;TL&#39;,&#39;&#39;).replace(&#39;,&#39;,&#39;.&#39;)) for div in driver.find_elements(By.CSS_SELECTOR, &#39;div[data-test-id=price-current-price]&#39;)]
for i,url in enumerate(product[&#39;url&#39;]):
    print(f&#39;Search seller names {i+1}/{number_of_loaded_products}&#39;, end=&#39;\r&#39;)
    driver.get(url)
    product[&#39;seller&#39;] += [wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, &#39;.seller a&#39;))).text]
    product[&#39;url&#39;][i] = driver.current_url # useful to replace some long urls
    
# Sort by price in ascending order
import pandas as pd
product_list = pd.DataFrame(product).sort_values(by=&#39;price&#39;).to_dict(&#39;list&#39;)

print(f&quot;\nThe product selected from the search results is:&quot;+
      f&quot;\nname:   {product_list[&#39;name&#39;][0]}&quot;+
      f&quot;\nprice:  {product_list[&#39;price&#39;][0]}&quot;+
      f&quot;\nseller: {product_list[&#39;seller&#39;][0]}&quot;+
      f&quot;\nurl:    {product_list[&#39;url&#39;][0]}&quot;)
    
# Go to the page of the selected product
driver.get(product_list[&#39;url&#39;][0])

# Get the prices
prices = []
dates = []
while len(prices) &lt; months:
    price_elems = driver.find_elements(By.XPATH, &quot;//div[@class=&#39;price-area&#39;]//strong[@itemprop=&#39;price&#39;]&quot;)
    print(price_elems)
    date_elems = driver.find_elements(By.XPATH, &quot;//div[@class=&#39;product-info&#39;]//span[@class=&#39;product-info-date&#39;]&quot;)
    print(date_elems)
    for price_elem, date_elem in zip(price_elems, date_elems):
        price = float(price_elem.text.replace(&#39;.&#39;, &#39;&#39;).replace(&#39;,&#39;, &#39;.&#39;))
        date = pd.to_datetime(date_elem.text, format=&#39;%d %B %Y, %H:%M&#39;)
        prices.append(price)
        dates.append(date)
        
    next_button = driver.find_element(By.XPATH, &quot;//a[@class=&#39;page-next&#39;]&quot;)
    if &#39;disabled&#39; in next_button.get_attribute(&#39;class&#39;):
        break
    else:
        driver.execute_script(&quot;arguments[0].click();&quot;, next_button)
        

# Create a DataFrame and select the data for the last X months
df = pd.DataFrame({&#39;Date&#39;: dates, &#39;Price&#39;: prices})
df[&#39;Hour&#39;] = df[&#39;Date&#39;].dt.hour
df = df.groupby([&#39;Date&#39;, &#39;Hour&#39;]).mean().reset_index()
start_date = pd.Timestamp.today() - pd.DateOffset(months=months)
end_date = pd.Timestamp.today()
df = df.loc[(df[&#39;Date&#39;] &gt;= start_date) &amp; (df[&#39;Date&#39;] &lt;= end_date)]

# Create the plot
plt.plot(df[&#39;Date&#39;], df[&#39;Price&#39;])
plt.title(&#39;Price Changes of {} in the Last {} Months&#39;.format(product_list[&#39;name&#39;][0], months))
plt.xlabel(&#39;Date&#39;)
plt.ylabel(&#39;Price (TL)&#39;)
plt.show()

I am trying to create a graph of the price of a product searched on the website Hepsiburada.com.tr with the cheapest price, for the same seller during a certain month, at the same time (For instance, let's say the product whose price we follow is "pınar süt 1lt"). However, I could not draw the graph because I could not obtain the "prices" and "dates" information.This list is empty. How can I obtain this graph?

Focusing point:The piece of code under the '# Get the prices' comment is working incorrectly. The code up to this part is working properly.

答案1

得分: 1

我会尝试指导您找到解决方案。根据我的理解,您需要跟踪产品价格变化并以某种方式处理它们。您可以通过定期运行一个脚本来收集产品价格并将数据存储在某个地方以供将来分析。

我看到您目前正在使用WebDriver来获取价格数据。我的第一个建议是尝试使用Web API 来代替。通过HTTP与网站进行通信比通过UI要快得多,更稳定。自动化UI需要处理与元素无法访问、复杂的等待和意外重叠控件等不同问题。Web API为您提供了一个清晰的接口,以请求和接收所需的数据,而不需要处理UI的开销。简而言之,UI是为人类,API是为机器。如果可能的话,请使用API。

如果您需要继续使用WebDriver,请检查以下内容以解决数据提取问题:

  • 在提取之前,确保页面上显示所需数据

确保导致目标数据页面的步骤已成功完成。有时自动化交互会悄无声息地失败,所需的数据不会显示。按钮点击可能会被跳过,元素状态等待可能不正确,因此脚本在数据显示时提取数据。

实时观察脚本执行情况,并确保它在提取数据之前成功通过所有步骤。如果某些步骤失败,最初使用睡眠来验证是否是页面状态问题,然后根据情况替换自定义等待。如果某些点击方法失败,请尝试使用不同的点击方法。

  • 确保所需数据在视图内,而不是在可见页面区域之外

要从某些控件中提取数据,它们需要在视图中。必要时滚动页面

  • 与子元素交互之前,选中父级iframe

某些UI控件可能放置在iframes内。如果与控件的交互失败,请检查它是否在iframe内。需要在与控件交互之前将WebDriver切换到父iframe。使用浏览器中的WebInspector查找给定的UI控件是否在iframe内。

  • 尝试在另一个浏览器中执行脚本

理想情况下,WebDriver脚本应该在所有浏览器中都能正常工作,但实际上会出现问题,一个浏览器中失败的按钮点击在另一个浏览器中可能会成功。如果您的脚本看起来很完美,但元素交互仍然失败,请尝试另一个浏览器。

英文:

I'll try to direct you towards a solution. As I understand you need to track product price changes and process them somehow. You can do it by periodically running a script that collects product prices and stores data somewhere for future analysis.

I see currently you use WebDriver to grab price data. My first suggestion is to try to use Web API instead. Communication with a website via HTTP is much faster and more stable than via UI. Automating UI requires you to deal with different issues related to element inaccessibility, tricky waits, and unexpected overlapping controls. Web API gives you a clear interface to request and receive needed data without the overhead to handle UI. Speaking shortly - UI is for humans, API is for machines. Use API if possible.

In case you need to stay with WebDriver, check the following to address data extraction issues:

  • the required data is displayed on the page before extraction

Make sure the steps leading to the target data page are completed successfully. It may happen some automated interactions fail silently and the needed data is not shown.
Button clicks may be skipped, element state waits may be wrong thus letting the script extract data when it's not displayed.

Watch your script execution in real-time and make sure it successfully passes all steps before data extraction. If some steps fail, put sleeps initially just to verify it is a page state issue, then replace on custom waits if so. Try to use different click methods if some fail.

  • make sure the required data is in view, not outside the visible page area

To extract data from some controls they need to be in view. Scroll the page if needed

  • parent iframe is selected before interaction with a child element

Some UI controls may be put inside iframes. If interaction with a control fails, check if it is inside iframe. WebDriver needs to be switched to the parent iframe before interaction with controls inside.
Use WebInspector in a browser to find if a given UI control is inside iframe

  • try to execute a script in another browser

Ideally, WebDriver script should work the same for all browsers but in fact issues happen and a button click that fails in one browser may works in another one.
If your script looks perfect but element interaction still fails, try another browser.

huangapple
  • 本文由 发表于 2023年3月3日 22:50:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75628559.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定