2023年6月19日 14:14:27go评论101阅读模式

英文:

Selenium WebDriverWait TimeoutException When Trying to Fetch Data into Pandas DataFrame

问题

你好，Stack Overflow社区，

我目前正在编写一个Python脚本，涉及从网页获取数据并将其存储在pandas DataFrame中。然而，我遇到了一个问题，DataFrame返回为空。我无法按预期获取记录。

这是我正在使用的代码：

# 你的Python代码

当我运行此代码时，我期望看到一个填充了我试图获取的记录的DataFrame。然而，我得到的是一个空的DataFrame。我尝试通过检查记录的源并确保数据确实存在来调试此问题，但我仍然无法填充DataFrame。

这是我收到的错误消息：

# 错误消息

我对Python、Selenium和pandas相对陌生，所以不确定我做错了什么。有人能够提供可能存在的问题吗？非常感谢您的帮助。

提前感谢您！

英文:

Hello Stack Overflow community,

I'm currently working on a Python script that involves fetching data from a webpage and storing it in a pandas DataFrame. However, I'm encountering an issue where the DataFrame is returning as null. I'm unable to fetch the records as expected.

Here's the code I'm working with:


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
def extract_data_from_table(table):
    countries = []
    regions_states = []
    start_dates = []
    end_dates = []
    if table is not None:
        for row in table.find_all(&#39;tr&#39;)[1:]:
            columns = row.find_all(&#39;td&#39;)
            if len(columns) &gt;= 4:
                countries.append(columns[0].text.strip())
                regions_states.append(columns[1].text.strip())
                start_dates.append(columns[2].text.strip())
                end_dates.append(columns[3].text.strip())
        return pd.DataFrame({
            &#39;Country&#39;: countries,
            &#39;Regions/States&#39;: regions_states,
            &#39;DST Start Date&#39;: start_dates,
            &#39;DST End Date&#39;: end_dates
        }) 
    else:
        return None
url = &quot;https://www.timeanddate.com/time/dst/2023.html&quot;
# Create an instance of Chrome Options
options = Options()
options.add_argument(&quot;start-maximized&quot;)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# Set up a WebDriverWait that will wait up to 1000 seconds for the table to appear
wait = WebDriverWait(driver, 1000)
driver.get(url)
# Wait for the table to appear
wait.until(EC.presence_of_element_located((By.CLASS_NAME, &#39;table table--inner-borders-all table--left table--striped table--hover&#39;)))
# Get the page source and parse it with BeautifulSoup
soup = BeautifulSoup(driver.page_source, &#39;html.parser&#39;)
table = soup.find(&#39;table&#39;, class_=&#39;table table--inner-borders-all table--left table--striped table--hover&#39;)
df = extract_data_from_table(table)
driver.quit()
if df is not None:
    print(df)

When I run this code, I expect to see a DataFrame filled with the records I'm trying to fetch. However, what I'm getting instead is a null DataFrame. I've tried to debug this issue by checking the source of the records and ensuring that the data is indeed there, but I'm still unable to populate the DataFrame.

Here's the error message I'm receiving:
Traceback (most recent call last): File "/Users/rajeevranjanpandey/test.py", line 51, in <module> wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table table--inner-borders-all table--left table--striped table--hover'))) File "/Users/rajeevranjanpandey/Library/Python/3.9/lib/python/site-packages/selenium/webdriver/support/wait.py", line 95, in until raise TimeoutException(message, screen, stacktrace)

I'm relatively new to Python, Selenium, and pandas, so I'm not sure what I'm doing wrong. Could anyone suggest what might be the issue here? Any help would be greatly appreciated.

Thank you in advance!

Here are the steps I've taken to try to solve this problem:

Checked the URL to ensure it's correct and the webpage is accessible.
Verified that the table I'm trying to scrape exists on the webpage.
Checked the class name of the table in the webpage's HTML to ensure it matches the one in my code.
Increased the WebDriverWait timeout to see if the table just needed more time to load.

Despite these steps, I'm still encountering the same issue. I'm relatively new to Python, Selenium, and pandas, so I'm not sure what I'm doing wrong. Could anyone suggest what might be the issue here? Any help would be greatly appreciated.

Thank you in advance!

答案1

得分: 0

`By.CLASS_NAME` 只接受单个类值，而不是多个类值，
请改用 `By.CSS_SELECTOR`。
//等待表格出现
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover')))
//获取表格元素的HTML内容
tableContent = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover'))).get_attribute("outerHtml")
//使用内置方法获取数据框，无需使用soup和解析
df = pd.read_html(tableContent)[0]
print(df)
----------
或者你可以只使用两行代码，甚至不需要selenium。
df = pd.read_html("https://www.timeanddate.com/time/dst/2023.html")
print(df[0])
快照：
[![enter image description here][1]][1]
  [1]: https://i.stack.imgur.com/aGwg6.png

英文:

By.CLASS_NAME only accept single class value not the multiple class values,
Instead use By.CSS_SELECTOR

//Wait for the table to appear

wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, &#39;.table.table--inner-borders-all.table--left.table--striped.table--hover&#39;)))

//Get the html of the table element

tableContent=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, &#39;.table.table--inner-borders-all.table--left.table--striped.table--hover&#39;))).get_attribute(&quot;outerHtml&quot;)

//Use in built method to get the data frame, no need to use soup and parsing

df=pd.read_html(tableContent)[0]
print(df)

Alternatively you can just use two lines of code, selenium even doesn't need.

df=pd.read_html(&quot;https://www.timeanddate.com/time/dst/2023.html&quot;)
print(df[0])

Snapshot:

答案2

得分: 0

以下是根据您的解析逻辑的完整解决方案。

不使用Selenium（而是使用requests+BeautifulSoup）的代码：

from bs4 import BeautifulSoup
import pandas as pd
import requests
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
def extract_data_from_table(table):
    countries = []
    regions_states = []
    start_dates = []
    end_dates = []
    prev_country = None
    if table:
        for row in table.find('tbody').find_all('tr'):
            try:
                country_col = row.find('th').text.strip()
                prev_country = country_col
            except AttributeError:
                pass
            other_col = row.find_all('td')
            if len(other_col) > 2:
                countries.append(prev_country)
                regions_states.append(other_col[0].text.strip())
                start_dates.append(other_col[1].text.strip())
                end_dates.append(other_col[2].text.strip())
        return pd.DataFrame({
            'Country': countries,
            'Regions/States': regions_states,
            'DST Start Date': start_dates,
            'DST End Date': end_dates
        })
    else:
        return None
data = requests.get('https://www.timeanddate.com/time/dst/2023.html', headers={"Accept-Language": "en"})
soup = BeautifulSoup(data.text, 'html.parser')
table = soup.find('table', class_='table table--inner-borders-all table--left table--striped table--hover')
df = extract_data_from_table(table)
if df is not None:
    print(df)

输出结果：

                      Country                                     Regions/States         DST Start Date            DST End Date
0               Åland Islands                                      All locations       Sunday, 26 March      Sunday, 29 October
1                     Albania                                      All locations       Sunday, 26 March      Sunday, 29 October
2                     Andorra                                      All locations       Sunday, 26 March      Sunday, 29 October
3                  Antarctica                                     Some locations   Sunday, 24 September         Sunday, 2 April
4                  Antarctica                                      Troll Station       Sunday, 19 March      Sunday, 29 October
5                   Australia                                     Most locations      Sunday, 1 October         Sunday, 2 April
6                   Australia                                   Lord Howe Island      Sunday, 1 October         Sunday, 2 April
7                     Austria                                      All locations       Sunday, 26 March      Sunday, 29 October
8                     Belgium                                      All locations       Sunday, 26 March      Sunday, 29 October
9                     Bermuda                                      All locations       Sunday, 12 March      Sunday, 5 November
10     Bosnia and Herzegovina                                      All locations       Sunday, 26 March      Sunday, 29 October
11                   Bulgaria                                      All locations       Sunday, 26 March      Sunday, 29 October
12                     Canada                                     Most locations       Sunday, 12 March      Sunday, 5 November
13                      Chile                                     Most locations    Sunday, 3 September         Sunday, 2 April
14                      Chile                                      Easter Island  Saturday, 2 September       Saturday, 1 April
...
(剩下的输出省略)

要确保结果以英文显示，请在请求中传递headers，如下所示：

# 将URL设置为动态检测当前年份
from datetime import datetime
data = requests.get(f'https://www.timeanddate.com/time/dst/{datetime.now().year}.html', headers={"Accept-Language": "en"})

您可以简单地传递datetime.now().year以获取当前年份。如果明年运行它，URL将变为https://www.timeanddate.com/time/dst/2024.html，依此类推。

英文:

Here's the complete solution as per your parsing logic.

Without using Selenium (with requests+BeautifulSoup):

from bs4 import BeautifulSoup
import pandas as pd
import requests
pd.set_option(&#39;display.max_rows&#39;, 500)
pd.set_option(&#39;display.max_columns&#39;, 500)
pd.set_option(&#39;display.width&#39;, 1000)
def extract_data_from_table(table):
    countries = []
    regions_states = []
    start_dates = []
    end_dates = []
    prev_country = None
    if table:
        for row in table.find(&#39;tbody&#39;).find_all(&#39;tr&#39;):
            try:
                country_col = row.find(&#39;th&#39;).text.strip()
                prev_country = country_col
            except AttributeError:
                pass
            other_col = row.find_all(&#39;td&#39;)
            if len(other_col) &gt; 2:
                countries.append(prev_country)
                regions_states.append(other_col[0].text.strip())
                start_dates.append(other_col[1].text.strip())
                end_dates.append(other_col[2].text.strip())
        return pd.DataFrame({
            &#39;Country&#39;: countries,
            &#39;Regions/States&#39;: regions_states,
            &#39;DST Start Date&#39;: start_dates,
            &#39;DST End Date&#39;: end_dates
        })
    else:
        return None
data = requests.get(&#39;https://www.timeanddate.com/time/dst/2023.html&#39;, headers={&quot;Accept-Language&quot;: &quot;en&quot;})
soup = BeautifulSoup(data.text, &#39;html.parser&#39;)
table = soup.find(&#39;table&#39;, class_=&#39;table table--inner-borders-all table--left table--striped table--hover&#39;)
df = extract_data_from_table(table)
if df is not None:
    print(df)

output:

                      Country                                     Regions/States         DST Start Date            DST End Date
0               &#197;land Islands                                      All locations       Sunday, 26 March      Sunday, 29 October
1                     Albania                                      All locations       Sunday, 26 March      Sunday, 29 October
2                     Andorra                                      All locations       Sunday, 26 March      Sunday, 29 October
3                  Antarctica                                     Some locations   Sunday, 24 September         Sunday, 2 April
4                  Antarctica                                      Troll Station       Sunday, 19 March      Sunday, 29 October
5                   Australia                                     Most locations      Sunday, 1 October         Sunday, 2 April
6                   Australia                                   Lord Howe Island      Sunday, 1 October         Sunday, 2 April
7                     Austria                                      All locations       Sunday, 26 March      Sunday, 29 October
8                     Belgium                                      All locations       Sunday, 26 March      Sunday, 29 October
9                     Bermuda                                      All locations       Sunday, 12 March      Sunday, 5 November
10     Bosnia and Herzegovina                                      All locations       Sunday, 26 March      Sunday, 29 October
11                   Bulgaria                                      All locations       Sunday, 26 March      Sunday, 29 October
12                     Canada                                     Most locations       Sunday, 12 March      Sunday, 5 November
13                      Chile                                     Most locations    Sunday, 3 September         Sunday, 2 April
14                      Chile                                      Easter Island  Saturday, 2 September       Saturday, 1 April
15                    Croatia                                      All locations       Sunday, 26 March      Sunday, 29 October
16                       Cuba                                      All locations       Sunday, 12 March      Sunday, 5 November
17                     Cyprus                                      All locations       Sunday, 26 March      Sunday, 29 October
18                    Czechia                                      All locations       Sunday, 26 March      Sunday, 29 October
19                    Denmark                                      All locations       Sunday, 26 March      Sunday, 29 October
20                      Egypt                                      All locations       Friday, 28 April      Friday, 27 October
21                    Estonia                                      All locations       Sunday, 26 March      Sunday, 29 October
22              Faroe Islands                                      All locations       Sunday, 26 March      Sunday, 29 October
23                       Fiji                                      All locations    Sunday, 12 November  Does not end this year
24                    Finland                                      All locations       Sunday, 26 March      Sunday, 29 October
25                     France                                     Most locations       Sunday, 26 March      Sunday, 29 October
26                    Germany                                      All locations       Sunday, 26 March      Sunday, 29 October
27                  Gibraltar                                      All locations       Sunday, 26 March      Sunday, 29 October
28                     Greece                                      All locations       Sunday, 26 March      Sunday, 29 October
29                  Greenland                                     Most locations     Saturday, 25 March    Saturday, 28 October
30                  Greenland                                   Ittoqqortoormiit       Sunday, 26 March      Sunday, 29 October
31                  Greenland                                     Thule Air Base       Sunday, 12 March      Sunday, 5 November
32                   Guernsey                                      All locations       Sunday, 26 March      Sunday, 29 October
33                      Haiti                                      All locations       Sunday, 12 March      Sunday, 5 November
34                    Hungary                                      All locations       Sunday, 26 March      Sunday, 29 October
35                    Ireland                                      All locations       Sunday, 26 March      Sunday, 29 October
36                Isle of Man                                      All locations       Sunday, 26 March      Sunday, 29 October
37                     Israel                                      All locations       Friday, 24 March      Sunday, 29 October
38                      Italy                                      All locations       Sunday, 26 March      Sunday, 29 October
39                     Jersey                                      All locations       Sunday, 26 March      Sunday, 29 October
40                     Kosovo                                      All locations       Sunday, 26 March      Sunday, 29 October
41                     Latvia                                      All locations       Sunday, 26 March      Sunday, 29 October
42                    Lebanon                                      All locations     Thursday, 30 March      Sunday, 29 October
43              Liechtenstein                                      All locations       Sunday, 26 March      Sunday, 29 October
44                  Lithuania                                      All locations       Sunday, 26 March      Sunday, 29 October
45                 Luxembourg                                      All locations       Sunday, 26 March      Sunday, 29 October
46                      Malta                                      All locations       Sunday, 26 March      Sunday, 29 October
47                     Mexico  Baja California, much of Chihuahua, much of Ta...       Sunday, 12 March      Sunday, 5 November
48                    Moldova                                      All locations       Sunday, 26 March      Sunday, 29 October
49                     Monaco                                      All locations       Sunday, 26 March      Sunday, 29 October
50                 Montenegro                                      All locations       Sunday, 26 March      Sunday, 29 October
51                    Morocco                                      All locations       Sunday, 23 April        Sunday, 19 March
52                Netherlands                                     Most locations       Sunday, 26 March      Sunday, 29 October
53                New Zealand                                      All locations   Sunday, 24 September         Sunday, 2 April
54             Norfolk Island                                      All locations      Sunday, 1 October         Sunday, 2 April
55            North Macedonia                                      All locations       Sunday, 26 March      Sunday, 29 October
56                     Norway                                      All locations       Sunday, 26 March      Sunday, 29 October
57                  Palestine                                      All locations     Saturday, 29 April    Saturday, 28 October
58                   Paraguay                                      All locations      Sunday, 1 October        Sunday, 26 March
59                     Poland                                      All locations       Sunday, 26 March      Sunday, 29 October
60                   Portugal                                      All locations       Sunday, 26 March      Sunday, 29 October
61                    Romania                                      All locations       Sunday, 26 March      Sunday, 29 October
62  Saint Pierre and Miquelon                                      All locations       Sunday, 12 March      Sunday, 5 November
63                 San Marino                                      All locations       Sunday, 26 March      Sunday, 29 October
64                     Serbia                                      All locations       Sunday, 26 March      Sunday, 29 October
65                   Slovakia                                      All locations       Sunday, 26 March      Sunday, 29 October
66                   Slovenia                                      All locations       Sunday, 26 March      Sunday, 29 October
67                      Spain                                      All locations       Sunday, 26 March      Sunday, 29 October
68                     Sweden                                      All locations       Sunday, 26 March      Sunday, 29 October
69                Switzerland                                      All locations       Sunday, 26 March      Sunday, 29 October
70                The Bahamas                                      All locations       Sunday, 12 March      Sunday, 5 November
71   Turks and Caicos Islands                                      All locations       Sunday, 12 March      Sunday, 5 November
72                    Ukraine                                     Most locations       Sunday, 26 March      Sunday, 29 October
73             United Kingdom                                      All locations       Sunday, 26 March      Sunday, 29 October
74              United States                                     Most locations       Sunday, 12 March      Sunday, 5 November
75    Vatican City (Holy See)                                      All locations       Sunday, 26 March      Sunday, 29 October
76             Western Sahara                                      All locations       Sunday, 23 April        Sunday, 19 March

To make sure that the result is in English, pass the headers along with requests. headers={"Accept-Language": "en"}

[UPDATE]: To answer your 2nd question in the comment below:

# to make the URL dynamic sensing the current year
from datetime import datetime
data = requests.get(f&#39;https://www.timeanddate.com/time/dst/{datetime.now().year}.html&#39;, headers={&quot;Accept-Language&quot;: &quot;en&quot;})

you can simply pass datetime.now().year to get the year at present. if you run it next year, the URL will be https://www.timeanddate.com/time/dst/2024.html and so on.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Selenium WebDriverWait 在尝试将数据提取到 Pandas DataFrame 时出现 TimeoutException。

问题

答案1

答案2

how can i split a pandas dataframe by elements on a column?

从路径列中提取文件名

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

禁用 Selenium WebDriver Java 中的弹出窗口 “恢复页面？Chrome 没有正确关闭。”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。