英文:
Python web scraping with selenium returns empty list
问题
这是从网站上爬取表格并保存到CSV文件的代码。请注意,如果你想将这两段代码合并成一个完整的工作流程,你需要确保在第一段代码中正确地将数据存储在templist
列表中,然后在第二段代码中将templist
转换为DataFrame,并将DataFrame保存为CSV文件。
第一段代码用于爬取数据:
# driver
driver = webdriver.Chrome(executable_path="path/chromedriver.exe")
url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'
driver.get(url)
# creating an empty list
r = 1
templist = []
# start scraping and parsing
while(1):
try:
# ... 爬取和解析数据 ...
templist.append(Table_dict)
r += 1
# if there are no more table data to scrape
except NoSuchElementException:
break
# 关闭浏览器驱动程序
driver.close()
第二段代码用于将数据显示并将其保存到CSV文件:
import pandas as pd
# 创建DataFrame
df = pd.DataFrame(templist)
# 保存DataFrame到CSV文件
df.to_csv('D:/FYP/WIND_Prev_dir/WB8.csv')
将这两段代码组合在一起,确保在第一段代码中正确地填充templist
,然后使用第二段代码创建DataFrame和保存CSV文件。这样,你就可以爬取表格并将其保存为CSV文件了。
英文:
I am trying to scrape the table from a website as a trial
and since I am not very good at scraping, I tried to use codes from other websites and do some customizations
There are two main problems:
- it seems there is no results scraped
- when I try to save the result, it tells me that 'df' is not defined, but in fact, it has been part of the code
I have tried displaying the scraped table with another code, it worked, but I just can't save it into csv with the append() function
Major goal: scrape the table, and save it to csv
Minor goal: writing a loop that gives proper naming to the month column, so that the code can be shortened a bit
Any help is appreciated!! Thanks!!
This is the code returning empty list, and error stating 'df' not defined in the final step
#driver
driver = webdriver.Chrome(executable_path="path/chromedriver.exe")
url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'
driver.get(url)
#creating empty list
r = 1
templist = []
#start scraping and parsing
while(1):
try:
day = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[1]").text
Jan = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[2]").text
Feb = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[3]").text
Mar = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[4]").text
Apr = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[5]").text
May = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[6]").text
Jun = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[7]").text
Jul = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[8]").text
Aug = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[9]").text
Sep = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[10]").text
Oct = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[11]").text
Nov = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[12]").text
Dec = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[13]").text
Table_dict = {' ': day,
'Jan': Jan,
'Feb': Feb,
'Mar': Mar,
'Apr': Apr,
'May': May,
'Jun': Jun,
'Jul': Jul,
'Aug': Aug,
'Sep': Sep,
'Oct': Oct,
'Nov': Nov,
'Dec': Dec}
print (Feb)
templist.append(Table_dict)
df = pd.DataFrame(templist)
r += 1
print (df)
# if there are no more table data to scrape
except NoSuchElementException:
break
# saving the dataframe to a csv
df.to_csv('D:/FYP/WIND_Prev_dir/WB8.csv')
driver.close()
This is the code used for displaying the scraped table but not sure how to turn it to csv:
# Obtain the number of rows in body
rows = len(driver.find_elements(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr"))
# Obtain the number of columns in table
cols = len(driver.find_elements(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[1]/th"))
# Print rows and columns
print(rows)
print(cols)
# Printing the table headers
months = print(" " + "Jan" + " " + "Feb" + " " + "Mar" + " " + "Apr" + " " + "May" + " " + "Jun" + " " + "Jul" + " " + "Aug" + " " + "Sep" + " " + "Oct" + " " + "Nov" + " " + "Dec")
# Printing the data of the table
for r in range(2, rows+1):
for p in range(1, cols+1):
# obtaining the text from each column of the table
value = driver.find_element(By.XPATH,
"/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td["+str(p)+"]").text
if value == " ":
print ("N.A.", end = ' ')
else:
print(value, end=' ')
#df.append(value)
print()
答案1
得分: 0
这是如何抓取整个表格数据并将其保存到CSV文件中的方法。
import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
driver = Chrome()
url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'
driver.get(url)
table = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table[id="t1"] > tr')))
columns = [i.text for i in table[0].find_elements(By.TAG_NAME, 'th')]
table_dict = {col: [] for col in columns}
for row in table[1:]:
for data in zip(columns, [i.text for i in row.find_elements(By.TAG_NAME, 'td')]):
table_dict[data[0]].append(data[1])
driver.close()
df = pd.DataFrame(table_dict)
# # 将数据框保存为csv
df.to_csv('data.csv', index=False)
英文:
Here's how you can scrape the whole table data and save them into a CSV file.
import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
driver = Chrome()
url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'
driver.get(url)
table = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table[id="t1"] > tr')))
columns = [i.text for i in table[0].find_elements(By.TAG_NAME, 'th')]
table_dict = {col: [] for col in columns}
for row in table[1:]:
for data in zip(columns, [i.text for i in row.find_elements(By.TAG_NAME, 'td')]):
table_dict[data[0]].append(data[1])
driver.close()
df = pd.DataFrame(table_dict)
# # saving the dataframe to a csv
df.to_csv('data.csv', index=False)
Few things to note:
- After hitting the URL, we need to wait for the table to get visibly located on the page and thus we find all the table rows
tr
which includes the firsttr
as the table's columns. - the variable
columns
is a list that holds the table column names (first row datatable[0]
) - Next, we initiate a variable
table_dict
and assign the columns as the key of this dict with their values as an empty list. - after that, we iterate over the remaining rows of the table, couple the list of columns with the row data and iterate over it to assign the data to its column.
- and finally, create a dataframe with
table_dict
and save it into a CSV filedata.csv
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论