Python使用Selenium进行网页抓取返回空列表。

huangapple go评论73阅读模式
英文:

Python web scraping with selenium returns empty list

问题

这是从网站上爬取表格并保存到CSV文件的代码。请注意,如果你想将这两段代码合并成一个完整的工作流程,你需要确保在第一段代码中正确地将数据存储在templist列表中,然后在第二段代码中将templist转换为DataFrame,并将DataFrame保存为CSV文件。

第一段代码用于爬取数据:

# driver
driver = webdriver.Chrome(executable_path="path/chromedriver.exe")

url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'

driver.get(url)

# creating an empty list
r = 1
templist = []

# start scraping and parsing
while(1):
    try:
        # ... 爬取和解析数据 ...
        templist.append(Table_dict)
        r += 1

    # if there are no more table data to scrape
    except NoSuchElementException:
        break

# 关闭浏览器驱动程序
driver.close()

第二段代码用于将数据显示并将其保存到CSV文件:

import pandas as pd

# 创建DataFrame
df = pd.DataFrame(templist)

# 保存DataFrame到CSV文件
df.to_csv('D:/FYP/WIND_Prev_dir/WB8.csv')

将这两段代码组合在一起,确保在第一段代码中正确地填充templist,然后使用第二段代码创建DataFrame和保存CSV文件。这样,你就可以爬取表格并将其保存为CSV文件了。

英文:

I am trying to scrape the table from a website as a trial
and since I am not very good at scraping, I tried to use codes from other websites and do some customizations

There are two main problems:

  1. it seems there is no results scraped
  2. when I try to save the result, it tells me that 'df' is not defined, but in fact, it has been part of the code

I have tried displaying the scraped table with another code, it worked, but I just can't save it into csv with the append() function

Major goal: scrape the table, and save it to csv
Minor goal: writing a loop that gives proper naming to the month column, so that the code can be shortened a bit

Any help is appreciated!! Thanks!!

This is the code returning empty list, and error stating 'df' not defined in the final step

#driver
driver = webdriver.Chrome(executable_path="path/chromedriver.exe")

url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'

driver.get(url)

#creating empty list
r = 1
templist = []

#start scraping and parsing

while(1):
    try:
        day = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[1]").text
        Jan = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[2]").text
        Feb = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[3]").text
        Mar = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[4]").text
        Apr = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[5]").text
        May = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[6]").text
        Jun = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[7]").text
        Jul = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[8]").text
        Aug = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[9]").text
        Sep = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[10]").text
        Oct = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[11]").text
        Nov = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[12]").text
        Dec = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td[13]").text
        
        Table_dict = {' ': day, 
                      'Jan': Jan,
                      'Feb': Feb,
                      'Mar': Mar,
                      'Apr': Apr,
                      'May': May,
                      'Jun': Jun,
                      'Jul': Jul,
                      'Aug': Aug,
                      'Sep': Sep,
                      'Oct': Oct,
                      'Nov': Nov,
                      'Dec': Dec}

        print (Feb)
        templist.append(Table_dict)
        df = pd.DataFrame(templist)
 
        r += 1
        print (df)

    # if there are no more table data to scrape
    except NoSuchElementException:
        break
 
# saving the dataframe to a csv
df.to_csv('D:/FYP/WIND_Prev_dir/WB8.csv')
driver.close()

This is the code used for displaying the scraped table but not sure how to turn it to csv:

# Obtain the number of rows in body
rows = len(driver.find_elements(By.XPATH,
    "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr"))
 
# Obtain the number of columns in table
cols = len(driver.find_elements(By.XPATH,
    "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[1]/th"))
 
# Print rows and columns
print(rows)
print(cols)
 
# Printing the table headers
months = print("        " + "Jan" + "       " + "Feb" + "       " + "Mar" + "       " + "Apr" + "       " + "May" + "       " + "Jun" + "       " + "Jul" + "       " + "Aug" + "       " + "Sep" + "       " + "Oct" + "       " + "Nov" + "       " + "Dec")

# Printing the data of the table
for r in range(2, rows+1):
    for p in range(1, cols+1):
        # obtaining the text from each column of the table
        value = driver.find_element(By.XPATH,
            "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr["+str(r)+"]/td["+str(p)+"]").text
        if value == " ":
            print ("N.A.", end = '       ')
        else:
            print(value, end='       ')
        #df.append(value)
    print()

答案1

得分: 0

这是如何抓取整个表格数据并将其保存到CSV文件中的方法。

import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC

driver = Chrome()

url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'
driver.get(url)

table = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table[id="t1"] > tr')))
columns = [i.text for i in table[0].find_elements(By.TAG_NAME, 'th')]
table_dict = {col: [] for col in columns}

for row in table[1:]:
    for data in zip(columns, [i.text for i in row.find_elements(By.TAG_NAME, 'td')]):
        table_dict[data[0]].append(data[1])

driver.close()

df = pd.DataFrame(table_dict)
# # 将数据框保存为csv
df.to_csv('data.csv', index=False)
英文:

Here's how you can scrape the whole table data and save them into a CSV file.

import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC

driver = Chrome()

url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&ele=PREV_DIR&y=2022'
driver.get(url)

table = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table[id="t1"] > tr')))
columns = [i.text for i in table[0].find_elements(By.TAG_NAME, 'th')]
table_dict = {col: [] for col in columns}

for row in table[1:]:
    for data in zip(columns, [i.text for i in row.find_elements(By.TAG_NAME, 'td')]):
        table_dict[data[0]].append(data[1])

driver.close()

df = pd.DataFrame(table_dict)
# # saving the dataframe to a csv
df.to_csv('data.csv', index=False)

Few things to note:

  1. After hitting the URL, we need to wait for the table to get visibly located on the page and thus we find all the table rows tr which includes the first tr as the table's columns.
  2. the variable columns is a list that holds the table column names (first row data table[0])
  3. Next, we initiate a variable table_dict and assign the columns as the key of this dict with their values as an empty list.
  4. after that, we iterate over the remaining rows of the table, couple the list of columns with the row data and iterate over it to assign the data to its column.
  5. and finally, create a dataframe with table_dict and save it into a CSV file data.csv.

huangapple
  • 本文由 发表于 2023年5月18日 13:13:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76277914.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定