英文:
Selenium WebDriverWait TimeoutException When Trying to Fetch Data into Pandas DataFrame
问题
你好,Stack Overflow社区,
我目前正在编写一个Python脚本,涉及从网页获取数据并将其存储在pandas DataFrame中。然而,我遇到了一个问题,DataFrame返回为空。我无法按预期获取记录。
这是我正在使用的代码:
# 你的Python代码
当我运行此代码时,我期望看到一个填充了我试图获取的记录的DataFrame。然而,我得到的是一个空的DataFrame。我尝试通过检查记录的源并确保数据确实存在来调试此问题,但我仍然无法填充DataFrame。
这是我收到的错误消息:
# 错误消息
我对Python、Selenium和pandas相对陌生,所以不确定我做错了什么。有人能够提供可能存在的问题吗?非常感谢您的帮助。
提前感谢您!
英文:
Hello Stack Overflow community,
I'm currently working on a Python script that involves fetching data from a webpage and storing it in a pandas DataFrame. However, I'm encountering an issue where the DataFrame is returning as null. I'm unable to fetch the records as expected.
Here's the code I'm working with:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
def extract_data_from_table(table):
countries = []
regions_states = []
start_dates = []
end_dates = []
if table is not None:
for row in table.find_all('tr')[1:]:
columns = row.find_all('td')
if len(columns) >= 4:
countries.append(columns[0].text.strip())
regions_states.append(columns[1].text.strip())
start_dates.append(columns[2].text.strip())
end_dates.append(columns[3].text.strip())
return pd.DataFrame({
'Country': countries,
'Regions/States': regions_states,
'DST Start Date': start_dates,
'DST End Date': end_dates
})
else:
return None
url = "https://www.timeanddate.com/time/dst/2023.html"
# Create an instance of Chrome Options
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# Set up a WebDriverWait that will wait up to 1000 seconds for the table to appear
wait = WebDriverWait(driver, 1000)
driver.get(url)
# Wait for the table to appear
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table table--inner-borders-all table--left table--striped table--hover')))
# Get the page source and parse it with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', class_='table table--inner-borders-all table--left table--striped table--hover')
df = extract_data_from_table(table)
driver.quit()
if df is not None:
print(df)
When I run this code, I expect to see a DataFrame filled with the records I'm trying to fetch. However, what I'm getting instead is a null DataFrame. I've tried to debug this issue by checking the source of the records and ensuring that the data is indeed there, but I'm still unable to populate the DataFrame.
Here's the error message I'm receiving:
Traceback (most recent call last):
File "/Users/rajeevranjanpandey/test.py", line 51, in <module>
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table table--inner-borders-all table--left table--striped table--hover')))
File "/Users/rajeevranjanpandey/Library/Python/3.9/lib/python/site-packages/selenium/webdriver/support/wait.py", line 95, in until
raise TimeoutException(message, screen, stacktrace)
I'm relatively new to Python, Selenium, and pandas, so I'm not sure what I'm doing wrong. Could anyone suggest what might be the issue here? Any help would be greatly appreciated.
Thank you in advance!
Here are the steps I've taken to try to solve this problem:
- Checked the URL to ensure it's correct and the webpage is accessible.
- Verified that the table I'm trying to scrape exists on the webpage.
- Checked the class name of the table in the webpage's HTML to ensure it matches the one in my code.
- Increased the WebDriverWait timeout to see if the table just needed more time to load.
Despite these steps, I'm still encountering the same issue. I'm relatively new to Python, Selenium, and pandas, so I'm not sure what I'm doing wrong. Could anyone suggest what might be the issue here? Any help would be greatly appreciated.
Thank you in advance!
答案1
得分: 0
`By.CLASS_NAME` 只接受单个类值,而不是多个类值,
请改用 `By.CSS_SELECTOR`。
//等待表格出现
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover')))
//获取表格元素的HTML内容
tableContent = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover'))).get_attribute("outerHtml")
//使用内置方法获取数据框,无需使用soup和解析
df = pd.read_html(tableContent)[0]
print(df)
----------
或者你可以只使用两行代码,甚至不需要selenium。
df = pd.read_html("https://www.timeanddate.com/time/dst/2023.html")
print(df[0])
快照:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/aGwg6.png
英文:
By.CLASS_NAME
only accept single class value not the multiple class values,
Instead use By.CSS_SELECTOR
//Wait for the table to appear
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover')))
//Get the html of the table element
tableContent=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover'))).get_attribute("outerHtml")
//Use in built method to get the data frame, no need to use soup and parsing
df=pd.read_html(tableContent)[0]
print(df)
Alternatively you can just use two lines of code, selenium even doesn't need.
df=pd.read_html("https://www.timeanddate.com/time/dst/2023.html")
print(df[0])
Snapshot:
答案2
得分: 0
以下是根据您的解析逻辑的完整解决方案。
不使用Selenium(而是使用requests+BeautifulSoup)的代码:
from bs4 import BeautifulSoup
import pandas as pd
import requests
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
def extract_data_from_table(table):
countries = []
regions_states = []
start_dates = []
end_dates = []
prev_country = None
if table:
for row in table.find('tbody').find_all('tr'):
try:
country_col = row.find('th').text.strip()
prev_country = country_col
except AttributeError:
pass
other_col = row.find_all('td')
if len(other_col) > 2:
countries.append(prev_country)
regions_states.append(other_col[0].text.strip())
start_dates.append(other_col[1].text.strip())
end_dates.append(other_col[2].text.strip())
return pd.DataFrame({
'Country': countries,
'Regions/States': regions_states,
'DST Start Date': start_dates,
'DST End Date': end_dates
})
else:
return None
data = requests.get('https://www.timeanddate.com/time/dst/2023.html', headers={"Accept-Language": "en"})
soup = BeautifulSoup(data.text, 'html.parser')
table = soup.find('table', class_='table table--inner-borders-all table--left table--striped table--hover')
df = extract_data_from_table(table)
if df is not None:
print(df)
输出结果:
Country Regions/States DST Start Date DST End Date
0 Åland Islands All locations Sunday, 26 March Sunday, 29 October
1 Albania All locations Sunday, 26 March Sunday, 29 October
2 Andorra All locations Sunday, 26 March Sunday, 29 October
3 Antarctica Some locations Sunday, 24 September Sunday, 2 April
4 Antarctica Troll Station Sunday, 19 March Sunday, 29 October
5 Australia Most locations Sunday, 1 October Sunday, 2 April
6 Australia Lord Howe Island Sunday, 1 October Sunday, 2 April
7 Austria All locations Sunday, 26 March Sunday, 29 October
8 Belgium All locations Sunday, 26 March Sunday, 29 October
9 Bermuda All locations Sunday, 12 March Sunday, 5 November
10 Bosnia and Herzegovina All locations Sunday, 26 March Sunday, 29 October
11 Bulgaria All locations Sunday, 26 March Sunday, 29 October
12 Canada Most locations Sunday, 12 March Sunday, 5 November
13 Chile Most locations Sunday, 3 September Sunday, 2 April
14 Chile Easter Island Saturday, 2 September Saturday, 1 April
...
(剩下的输出省略)
要确保结果以英文显示,请在请求中传递headers
,如下所示:
# 将URL设置为动态检测当前年份
from datetime import datetime
data = requests.get(f'https://www.timeanddate.com/time/dst/{datetime.now().year}.html', headers={"Accept-Language": "en"})
您可以简单地传递datetime.now().year
以获取当前年份。如果明年运行它,URL将变为https://www.timeanddate.com/time/dst/2024.html
,依此类推。
英文:
Here's the complete solution as per your parsing logic.
Without using Selenium (with requests+BeautifulSoup):
from bs4 import BeautifulSoup
import pandas as pd
import requests
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
def extract_data_from_table(table):
countries = []
regions_states = []
start_dates = []
end_dates = []
prev_country = None
if table:
for row in table.find('tbody').find_all('tr'):
try:
country_col = row.find('th').text.strip()
prev_country = country_col
except AttributeError:
pass
other_col = row.find_all('td')
if len(other_col) > 2:
countries.append(prev_country)
regions_states.append(other_col[0].text.strip())
start_dates.append(other_col[1].text.strip())
end_dates.append(other_col[2].text.strip())
return pd.DataFrame({
'Country': countries,
'Regions/States': regions_states,
'DST Start Date': start_dates,
'DST End Date': end_dates
})
else:
return None
data = requests.get('https://www.timeanddate.com/time/dst/2023.html', headers={"Accept-Language": "en"})
soup = BeautifulSoup(data.text, 'html.parser')
table = soup.find('table', class_='table table--inner-borders-all table--left table--striped table--hover')
df = extract_data_from_table(table)
if df is not None:
print(df)
output:
Country Regions/States DST Start Date DST End Date
0 Åland Islands All locations Sunday, 26 March Sunday, 29 October
1 Albania All locations Sunday, 26 March Sunday, 29 October
2 Andorra All locations Sunday, 26 March Sunday, 29 October
3 Antarctica Some locations Sunday, 24 September Sunday, 2 April
4 Antarctica Troll Station Sunday, 19 March Sunday, 29 October
5 Australia Most locations Sunday, 1 October Sunday, 2 April
6 Australia Lord Howe Island Sunday, 1 October Sunday, 2 April
7 Austria All locations Sunday, 26 March Sunday, 29 October
8 Belgium All locations Sunday, 26 March Sunday, 29 October
9 Bermuda All locations Sunday, 12 March Sunday, 5 November
10 Bosnia and Herzegovina All locations Sunday, 26 March Sunday, 29 October
11 Bulgaria All locations Sunday, 26 March Sunday, 29 October
12 Canada Most locations Sunday, 12 March Sunday, 5 November
13 Chile Most locations Sunday, 3 September Sunday, 2 April
14 Chile Easter Island Saturday, 2 September Saturday, 1 April
15 Croatia All locations Sunday, 26 March Sunday, 29 October
16 Cuba All locations Sunday, 12 March Sunday, 5 November
17 Cyprus All locations Sunday, 26 March Sunday, 29 October
18 Czechia All locations Sunday, 26 March Sunday, 29 October
19 Denmark All locations Sunday, 26 March Sunday, 29 October
20 Egypt All locations Friday, 28 April Friday, 27 October
21 Estonia All locations Sunday, 26 March Sunday, 29 October
22 Faroe Islands All locations Sunday, 26 March Sunday, 29 October
23 Fiji All locations Sunday, 12 November Does not end this year
24 Finland All locations Sunday, 26 March Sunday, 29 October
25 France Most locations Sunday, 26 March Sunday, 29 October
26 Germany All locations Sunday, 26 March Sunday, 29 October
27 Gibraltar All locations Sunday, 26 March Sunday, 29 October
28 Greece All locations Sunday, 26 March Sunday, 29 October
29 Greenland Most locations Saturday, 25 March Saturday, 28 October
30 Greenland Ittoqqortoormiit Sunday, 26 March Sunday, 29 October
31 Greenland Thule Air Base Sunday, 12 March Sunday, 5 November
32 Guernsey All locations Sunday, 26 March Sunday, 29 October
33 Haiti All locations Sunday, 12 March Sunday, 5 November
34 Hungary All locations Sunday, 26 March Sunday, 29 October
35 Ireland All locations Sunday, 26 March Sunday, 29 October
36 Isle of Man All locations Sunday, 26 March Sunday, 29 October
37 Israel All locations Friday, 24 March Sunday, 29 October
38 Italy All locations Sunday, 26 March Sunday, 29 October
39 Jersey All locations Sunday, 26 March Sunday, 29 October
40 Kosovo All locations Sunday, 26 March Sunday, 29 October
41 Latvia All locations Sunday, 26 March Sunday, 29 October
42 Lebanon All locations Thursday, 30 March Sunday, 29 October
43 Liechtenstein All locations Sunday, 26 March Sunday, 29 October
44 Lithuania All locations Sunday, 26 March Sunday, 29 October
45 Luxembourg All locations Sunday, 26 March Sunday, 29 October
46 Malta All locations Sunday, 26 March Sunday, 29 October
47 Mexico Baja California, much of Chihuahua, much of Ta... Sunday, 12 March Sunday, 5 November
48 Moldova All locations Sunday, 26 March Sunday, 29 October
49 Monaco All locations Sunday, 26 March Sunday, 29 October
50 Montenegro All locations Sunday, 26 March Sunday, 29 October
51 Morocco All locations Sunday, 23 April Sunday, 19 March
52 Netherlands Most locations Sunday, 26 March Sunday, 29 October
53 New Zealand All locations Sunday, 24 September Sunday, 2 April
54 Norfolk Island All locations Sunday, 1 October Sunday, 2 April
55 North Macedonia All locations Sunday, 26 March Sunday, 29 October
56 Norway All locations Sunday, 26 March Sunday, 29 October
57 Palestine All locations Saturday, 29 April Saturday, 28 October
58 Paraguay All locations Sunday, 1 October Sunday, 26 March
59 Poland All locations Sunday, 26 March Sunday, 29 October
60 Portugal All locations Sunday, 26 March Sunday, 29 October
61 Romania All locations Sunday, 26 March Sunday, 29 October
62 Saint Pierre and Miquelon All locations Sunday, 12 March Sunday, 5 November
63 San Marino All locations Sunday, 26 March Sunday, 29 October
64 Serbia All locations Sunday, 26 March Sunday, 29 October
65 Slovakia All locations Sunday, 26 March Sunday, 29 October
66 Slovenia All locations Sunday, 26 March Sunday, 29 October
67 Spain All locations Sunday, 26 March Sunday, 29 October
68 Sweden All locations Sunday, 26 March Sunday, 29 October
69 Switzerland All locations Sunday, 26 March Sunday, 29 October
70 The Bahamas All locations Sunday, 12 March Sunday, 5 November
71 Turks and Caicos Islands All locations Sunday, 12 March Sunday, 5 November
72 Ukraine Most locations Sunday, 26 March Sunday, 29 October
73 United Kingdom All locations Sunday, 26 March Sunday, 29 October
74 United States Most locations Sunday, 12 March Sunday, 5 November
75 Vatican City (Holy See) All locations Sunday, 26 March Sunday, 29 October
76 Western Sahara All locations Sunday, 23 April Sunday, 19 March
To make sure that the result is in English, pass the headers
along with requests. headers={"Accept-Language": "en"}
[UPDATE]: To answer your 2nd question in the comment below:
# to make the URL dynamic sensing the current year
from datetime import datetime
data = requests.get(f'https://www.timeanddate.com/time/dst/{datetime.now().year}.html', headers={"Accept-Language": "en"})
you can simply pass datetime.now().year
to get the year at present. if you run it next year, the URL will be https://www.timeanddate.com/time/dst/2024.html
and so on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论