Selenium WebDriverWait 在尝试将数据提取到 Pandas DataFrame 时出现 TimeoutException。

huangapple go评论101阅读模式
英文:

Selenium WebDriverWait TimeoutException When Trying to Fetch Data into Pandas DataFrame

问题

你好,Stack Overflow社区,

我目前正在编写一个Python脚本,涉及从网页获取数据并将其存储在pandas DataFrame中。然而,我遇到了一个问题,DataFrame返回为空。我无法按预期获取记录。

这是我正在使用的代码:

  1. # 你的Python代码

当我运行此代码时,我期望看到一个填充了我试图获取的记录的DataFrame。然而,我得到的是一个空的DataFrame。我尝试通过检查记录的源并确保数据确实存在来调试此问题,但我仍然无法填充DataFrame。

这是我收到的错误消息:

  1. # 错误消息

我对Python、Selenium和pandas相对陌生,所以不确定我做错了什么。有人能够提供可能存在的问题吗?非常感谢您的帮助。

提前感谢您!

英文:

Hello Stack Overflow community,

I'm currently working on a Python script that involves fetching data from a webpage and storing it in a pandas DataFrame. However, I'm encountering an issue where the DataFrame is returning as null. I'm unable to fetch the records as expected.

Here's the code I'm working with:

  1. from selenium import webdriver
  2. from selenium.webdriver.chrome.options import Options
  3. from selenium.webdriver.chrome.service import Service
  4. from selenium.webdriver.support.ui import WebDriverWait
  5. from selenium.webdriver.support import expected_conditions as EC
  6. from selenium.webdriver.common.by import By
  7. from webdriver_manager.chrome import ChromeDriverManager
  8. from bs4 import BeautifulSoup
  9. import pandas as pd
  10. def extract_data_from_table(table):
  11. countries = []
  12. regions_states = []
  13. start_dates = []
  14. end_dates = []
  15. if table is not None:
  16. for row in table.find_all('tr')[1:]:
  17. columns = row.find_all('td')
  18. if len(columns) >= 4:
  19. countries.append(columns[0].text.strip())
  20. regions_states.append(columns[1].text.strip())
  21. start_dates.append(columns[2].text.strip())
  22. end_dates.append(columns[3].text.strip())
  23. return pd.DataFrame({
  24. 'Country': countries,
  25. 'Regions/States': regions_states,
  26. 'DST Start Date': start_dates,
  27. 'DST End Date': end_dates
  28. })
  29. else:
  30. return None
  31. url = "https://www.timeanddate.com/time/dst/2023.html"
  32. # Create an instance of Chrome Options
  33. options = Options()
  34. options.add_argument("start-maximized")
  35. driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
  36. # Set up a WebDriverWait that will wait up to 1000 seconds for the table to appear
  37. wait = WebDriverWait(driver, 1000)
  38. driver.get(url)
  39. # Wait for the table to appear
  40. wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table table--inner-borders-all table--left table--striped table--hover')))
  41. # Get the page source and parse it with BeautifulSoup
  42. soup = BeautifulSoup(driver.page_source, 'html.parser')
  43. table = soup.find('table', class_='table table--inner-borders-all table--left table--striped table--hover')
  44. df = extract_data_from_table(table)
  45. driver.quit()
  46. if df is not None:
  47. print(df)

When I run this code, I expect to see a DataFrame filled with the records I'm trying to fetch. However, what I'm getting instead is a null DataFrame. I've tried to debug this issue by checking the source of the records and ensuring that the data is indeed there, but I'm still unable to populate the DataFrame.

Here's the error message I'm receiving:
Traceback (most recent call last):
File "/Users/rajeevranjanpandey/test.py", line 51, in <module>
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table table--inner-borders-all table--left table--striped table--hover')))
File "/Users/rajeevranjanpandey/Library/Python/3.9/lib/python/site-packages/selenium/webdriver/support/wait.py", line 95, in until
raise TimeoutException(message, screen, stacktrace)

I'm relatively new to Python, Selenium, and pandas, so I'm not sure what I'm doing wrong. Could anyone suggest what might be the issue here? Any help would be greatly appreciated.

Thank you in advance!

Here are the steps I've taken to try to solve this problem:

  • Checked the URL to ensure it's correct and the webpage is accessible.
  • Verified that the table I'm trying to scrape exists on the webpage.
  • Checked the class name of the table in the webpage's HTML to ensure it matches the one in my code.
  • Increased the WebDriverWait timeout to see if the table just needed more time to load.

Despite these steps, I'm still encountering the same issue. I'm relatively new to Python, Selenium, and pandas, so I'm not sure what I'm doing wrong. Could anyone suggest what might be the issue here? Any help would be greatly appreciated.

Thank you in advance!

答案1

得分: 0

  1. `By.CLASS_NAME` 只接受单个类值而不是多个类值
  2. 请改用 `By.CSS_SELECTOR`
  3. //等待表格出现
  4. wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover')))
  5. //获取表格元素的HTML内容
  6. tableContent = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover'))).get_attribute("outerHtml")
  7. //使用内置方法获取数据框无需使用soup和解析
  8. df = pd.read_html(tableContent)[0]
  9. print(df)
  10. ----------
  11. 或者你可以只使用两行代码甚至不需要selenium
  12. df = pd.read_html("https://www.timeanddate.com/time/dst/2023.html")
  13. print(df[0])
  14. 快照
  15. [![enter image description here][1]][1]
  16. [1]: https://i.stack.imgur.com/aGwg6.png
英文:

By.CLASS_NAME only accept single class value not the multiple class values,
Instead use By.CSS_SELECTOR

//Wait for the table to appear

  1. wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover')))

//Get the html of the table element

  1. tableContent=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table.table--inner-borders-all.table--left.table--striped.table--hover'))).get_attribute("outerHtml")

//Use in built method to get the data frame, no need to use soup and parsing

  1. df=pd.read_html(tableContent)[0]
  2. print(df)

Alternatively you can just use two lines of code, selenium even doesn't need.

  1. df=pd.read_html("https://www.timeanddate.com/time/dst/2023.html")
  2. print(df[0])

Snapshot:

Selenium WebDriverWait 在尝试将数据提取到 Pandas DataFrame 时出现 TimeoutException。

答案2

得分: 0

以下是根据您的解析逻辑的完整解决方案。

不使用Selenium(而是使用requests+BeautifulSoup)的代码:

  1. from bs4 import BeautifulSoup
  2. import pandas as pd
  3. import requests
  4. pd.set_option('display.max_rows', 500)
  5. pd.set_option('display.max_columns', 500)
  6. pd.set_option('display.width', 1000)
  7. def extract_data_from_table(table):
  8. countries = []
  9. regions_states = []
  10. start_dates = []
  11. end_dates = []
  12. prev_country = None
  13. if table:
  14. for row in table.find('tbody').find_all('tr'):
  15. try:
  16. country_col = row.find('th').text.strip()
  17. prev_country = country_col
  18. except AttributeError:
  19. pass
  20. other_col = row.find_all('td')
  21. if len(other_col) > 2:
  22. countries.append(prev_country)
  23. regions_states.append(other_col[0].text.strip())
  24. start_dates.append(other_col[1].text.strip())
  25. end_dates.append(other_col[2].text.strip())
  26. return pd.DataFrame({
  27. 'Country': countries,
  28. 'Regions/States': regions_states,
  29. 'DST Start Date': start_dates,
  30. 'DST End Date': end_dates
  31. })
  32. else:
  33. return None
  34. data = requests.get('https://www.timeanddate.com/time/dst/2023.html', headers={"Accept-Language": "en"})
  35. soup = BeautifulSoup(data.text, 'html.parser')
  36. table = soup.find('table', class_='table table--inner-borders-all table--left table--striped table--hover')
  37. df = extract_data_from_table(table)
  38. if df is not None:
  39. print(df)

输出结果:

  1. Country Regions/States DST Start Date DST End Date
  2. 0 Åland Islands All locations Sunday, 26 March Sunday, 29 October
  3. 1 Albania All locations Sunday, 26 March Sunday, 29 October
  4. 2 Andorra All locations Sunday, 26 March Sunday, 29 October
  5. 3 Antarctica Some locations Sunday, 24 September Sunday, 2 April
  6. 4 Antarctica Troll Station Sunday, 19 March Sunday, 29 October
  7. 5 Australia Most locations Sunday, 1 October Sunday, 2 April
  8. 6 Australia Lord Howe Island Sunday, 1 October Sunday, 2 April
  9. 7 Austria All locations Sunday, 26 March Sunday, 29 October
  10. 8 Belgium All locations Sunday, 26 March Sunday, 29 October
  11. 9 Bermuda All locations Sunday, 12 March Sunday, 5 November
  12. 10 Bosnia and Herzegovina All locations Sunday, 26 March Sunday, 29 October
  13. 11 Bulgaria All locations Sunday, 26 March Sunday, 29 October
  14. 12 Canada Most locations Sunday, 12 March Sunday, 5 November
  15. 13 Chile Most locations Sunday, 3 September Sunday, 2 April
  16. 14 Chile Easter Island Saturday, 2 September Saturday, 1 April
  17. ...
  18. (剩下的输出省略)

要确保结果以英文显示,请在请求中传递headers,如下所示:

  1. # 将URL设置为动态检测当前年份
  2. from datetime import datetime
  3. data = requests.get(f'https://www.timeanddate.com/time/dst/{datetime.now().year}.html', headers={"Accept-Language": "en"})

您可以简单地传递datetime.now().year以获取当前年份。如果明年运行它,URL将变为https://www.timeanddate.com/time/dst/2024.html,依此类推。

英文:

Here's the complete solution as per your parsing logic.

Without using Selenium (with requests+BeautifulSoup):

  1. from bs4 import BeautifulSoup
  2. import pandas as pd
  3. import requests
  4. pd.set_option('display.max_rows', 500)
  5. pd.set_option('display.max_columns', 500)
  6. pd.set_option('display.width', 1000)
  7. def extract_data_from_table(table):
  8. countries = []
  9. regions_states = []
  10. start_dates = []
  11. end_dates = []
  12. prev_country = None
  13. if table:
  14. for row in table.find('tbody').find_all('tr'):
  15. try:
  16. country_col = row.find('th').text.strip()
  17. prev_country = country_col
  18. except AttributeError:
  19. pass
  20. other_col = row.find_all('td')
  21. if len(other_col) > 2:
  22. countries.append(prev_country)
  23. regions_states.append(other_col[0].text.strip())
  24. start_dates.append(other_col[1].text.strip())
  25. end_dates.append(other_col[2].text.strip())
  26. return pd.DataFrame({
  27. 'Country': countries,
  28. 'Regions/States': regions_states,
  29. 'DST Start Date': start_dates,
  30. 'DST End Date': end_dates
  31. })
  32. else:
  33. return None
  34. data = requests.get('https://www.timeanddate.com/time/dst/2023.html', headers={"Accept-Language": "en"})
  35. soup = BeautifulSoup(data.text, 'html.parser')
  36. table = soup.find('table', class_='table table--inner-borders-all table--left table--striped table--hover')
  37. df = extract_data_from_table(table)
  38. if df is not None:
  39. print(df)

output:

  1. Country Regions/States DST Start Date DST End Date
  2. 0 Åland Islands All locations Sunday, 26 March Sunday, 29 October
  3. 1 Albania All locations Sunday, 26 March Sunday, 29 October
  4. 2 Andorra All locations Sunday, 26 March Sunday, 29 October
  5. 3 Antarctica Some locations Sunday, 24 September Sunday, 2 April
  6. 4 Antarctica Troll Station Sunday, 19 March Sunday, 29 October
  7. 5 Australia Most locations Sunday, 1 October Sunday, 2 April
  8. 6 Australia Lord Howe Island Sunday, 1 October Sunday, 2 April
  9. 7 Austria All locations Sunday, 26 March Sunday, 29 October
  10. 8 Belgium All locations Sunday, 26 March Sunday, 29 October
  11. 9 Bermuda All locations Sunday, 12 March Sunday, 5 November
  12. 10 Bosnia and Herzegovina All locations Sunday, 26 March Sunday, 29 October
  13. 11 Bulgaria All locations Sunday, 26 March Sunday, 29 October
  14. 12 Canada Most locations Sunday, 12 March Sunday, 5 November
  15. 13 Chile Most locations Sunday, 3 September Sunday, 2 April
  16. 14 Chile Easter Island Saturday, 2 September Saturday, 1 April
  17. 15 Croatia All locations Sunday, 26 March Sunday, 29 October
  18. 16 Cuba All locations Sunday, 12 March Sunday, 5 November
  19. 17 Cyprus All locations Sunday, 26 March Sunday, 29 October
  20. 18 Czechia All locations Sunday, 26 March Sunday, 29 October
  21. 19 Denmark All locations Sunday, 26 March Sunday, 29 October
  22. 20 Egypt All locations Friday, 28 April Friday, 27 October
  23. 21 Estonia All locations Sunday, 26 March Sunday, 29 October
  24. 22 Faroe Islands All locations Sunday, 26 March Sunday, 29 October
  25. 23 Fiji All locations Sunday, 12 November Does not end this year
  26. 24 Finland All locations Sunday, 26 March Sunday, 29 October
  27. 25 France Most locations Sunday, 26 March Sunday, 29 October
  28. 26 Germany All locations Sunday, 26 March Sunday, 29 October
  29. 27 Gibraltar All locations Sunday, 26 March Sunday, 29 October
  30. 28 Greece All locations Sunday, 26 March Sunday, 29 October
  31. 29 Greenland Most locations Saturday, 25 March Saturday, 28 October
  32. 30 Greenland Ittoqqortoormiit Sunday, 26 March Sunday, 29 October
  33. 31 Greenland Thule Air Base Sunday, 12 March Sunday, 5 November
  34. 32 Guernsey All locations Sunday, 26 March Sunday, 29 October
  35. 33 Haiti All locations Sunday, 12 March Sunday, 5 November
  36. 34 Hungary All locations Sunday, 26 March Sunday, 29 October
  37. 35 Ireland All locations Sunday, 26 March Sunday, 29 October
  38. 36 Isle of Man All locations Sunday, 26 March Sunday, 29 October
  39. 37 Israel All locations Friday, 24 March Sunday, 29 October
  40. 38 Italy All locations Sunday, 26 March Sunday, 29 October
  41. 39 Jersey All locations Sunday, 26 March Sunday, 29 October
  42. 40 Kosovo All locations Sunday, 26 March Sunday, 29 October
  43. 41 Latvia All locations Sunday, 26 March Sunday, 29 October
  44. 42 Lebanon All locations Thursday, 30 March Sunday, 29 October
  45. 43 Liechtenstein All locations Sunday, 26 March Sunday, 29 October
  46. 44 Lithuania All locations Sunday, 26 March Sunday, 29 October
  47. 45 Luxembourg All locations Sunday, 26 March Sunday, 29 October
  48. 46 Malta All locations Sunday, 26 March Sunday, 29 October
  49. 47 Mexico Baja California, much of Chihuahua, much of Ta... Sunday, 12 March Sunday, 5 November
  50. 48 Moldova All locations Sunday, 26 March Sunday, 29 October
  51. 49 Monaco All locations Sunday, 26 March Sunday, 29 October
  52. 50 Montenegro All locations Sunday, 26 March Sunday, 29 October
  53. 51 Morocco All locations Sunday, 23 April Sunday, 19 March
  54. 52 Netherlands Most locations Sunday, 26 March Sunday, 29 October
  55. 53 New Zealand All locations Sunday, 24 September Sunday, 2 April
  56. 54 Norfolk Island All locations Sunday, 1 October Sunday, 2 April
  57. 55 North Macedonia All locations Sunday, 26 March Sunday, 29 October
  58. 56 Norway All locations Sunday, 26 March Sunday, 29 October
  59. 57 Palestine All locations Saturday, 29 April Saturday, 28 October
  60. 58 Paraguay All locations Sunday, 1 October Sunday, 26 March
  61. 59 Poland All locations Sunday, 26 March Sunday, 29 October
  62. 60 Portugal All locations Sunday, 26 March Sunday, 29 October
  63. 61 Romania All locations Sunday, 26 March Sunday, 29 October
  64. 62 Saint Pierre and Miquelon All locations Sunday, 12 March Sunday, 5 November
  65. 63 San Marino All locations Sunday, 26 March Sunday, 29 October
  66. 64 Serbia All locations Sunday, 26 March Sunday, 29 October
  67. 65 Slovakia All locations Sunday, 26 March Sunday, 29 October
  68. 66 Slovenia All locations Sunday, 26 March Sunday, 29 October
  69. 67 Spain All locations Sunday, 26 March Sunday, 29 October
  70. 68 Sweden All locations Sunday, 26 March Sunday, 29 October
  71. 69 Switzerland All locations Sunday, 26 March Sunday, 29 October
  72. 70 The Bahamas All locations Sunday, 12 March Sunday, 5 November
  73. 71 Turks and Caicos Islands All locations Sunday, 12 March Sunday, 5 November
  74. 72 Ukraine Most locations Sunday, 26 March Sunday, 29 October
  75. 73 United Kingdom All locations Sunday, 26 March Sunday, 29 October
  76. 74 United States Most locations Sunday, 12 March Sunday, 5 November
  77. 75 Vatican City (Holy See) All locations Sunday, 26 March Sunday, 29 October
  78. 76 Western Sahara All locations Sunday, 23 April Sunday, 19 March

To make sure that the result is in English, pass the headers along with requests. headers={"Accept-Language": "en"}

[UPDATE]: To answer your 2nd question in the comment below:

  1. # to make the URL dynamic sensing the current year
  2. from datetime import datetime
  3. data = requests.get(f'https://www.timeanddate.com/time/dst/{datetime.now().year}.html', headers={"Accept-Language": "en"})

you can simply pass datetime.now().year to get the year at present. if you run it next year, the URL will be https://www.timeanddate.com/time/dst/2024.html and so on.

huangapple
  • 本文由 发表于 2023年6月19日 14:14:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76504024.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定