问题

这是从网站上爬取表格并保存到CSV文件的代码。请注意，如果你想将这两段代码合并成一个完整的工作流程，你需要确保在第一段代码中正确地将数据存储在templist列表中，然后在第二段代码中将templist转换为DataFrame，并将DataFrame保存为CSV文件。

第一段代码用于爬取数据：

# driver
driver = webdriver.Chrome(executable_path="path/chromedriver.exe")

url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&amp;ele=PREV_DIR&amp;y=2022'

driver.get(url)

# creating an empty list
r = 1
templist = []

# start scraping and parsing
while(1):
    try:
        # ... 爬取和解析数据 ...
        templist.append(Table_dict)
        r += 1

    # if there are no more table data to scrape
    except NoSuchElementException:
        break

# 关闭浏览器驱动程序
driver.close()

第二段代码用于将数据显示并将其保存到CSV文件：

import pandas as pd

# 创建DataFrame
df = pd.DataFrame(templist)

# 保存DataFrame到CSV文件
df.to_csv('D:/FYP/WIND_Prev_dir/WB8.csv')

将这两段代码组合在一起，确保在第一段代码中正确地填充templist，然后使用第二段代码创建DataFrame和保存CSV文件。这样，你就可以爬取表格并将其保存为CSV文件了。

英文:

I am trying to scrape the table from a website as a trial
and since I am not very good at scraping, I tried to use codes from other websites and do some customizations

There are two main problems:

it seems there is no results scraped
when I try to save the result, it tells me that 'df' is not defined, but in fact, it has been part of the code

I have tried displaying the scraped table with another code, it worked, but I just can't save it into csv with the append() function

Major goal: scrape the table, and save it to csv
Minor goal: writing a loop that gives proper naming to the month column, so that the code can be shortened a bit

Any help is appreciated!! Thanks!!

This is the code returning empty list, and error stating 'df' not defined in the final step

#driver
driver = webdriver.Chrome(executable_path=&quot;path/chromedriver.exe&quot;)

url = &#39;https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&amp;ele=PREV_DIR&amp;y=2022&#39;

driver.get(url)

#creating empty list
r = 1
templist = []

#start scraping and parsing

while(1):
    try:
        day = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[1]&quot;).text
        Jan = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[2]&quot;).text
        Feb = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[3]&quot;).text
        Mar = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[4]&quot;).text
        Apr = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[5]&quot;).text
        May = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[6]&quot;).text
        Jun = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[7]&quot;).text
        Jul = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[8]&quot;).text
        Aug = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[9]&quot;).text
        Sep = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[10]&quot;).text
        Oct = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[11]&quot;).text
        Nov = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[12]&quot;).text
        Dec = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[13]&quot;).text
        
        Table_dict = {&#39; &#39;: day, 
                      &#39;Jan&#39;: Jan,
                      &#39;Feb&#39;: Feb,
                      &#39;Mar&#39;: Mar,
                      &#39;Apr&#39;: Apr,
                      &#39;May&#39;: May,
                      &#39;Jun&#39;: Jun,
                      &#39;Jul&#39;: Jul,
                      &#39;Aug&#39;: Aug,
                      &#39;Sep&#39;: Sep,
                      &#39;Oct&#39;: Oct,
                      &#39;Nov&#39;: Nov,
                      &#39;Dec&#39;: Dec}

        print (Feb)
        templist.append(Table_dict)
        df = pd.DataFrame(templist)
 
        r += 1
        print (df)

    # if there are no more table data to scrape
    except NoSuchElementException:
        break
 
# saving the dataframe to a csv
df.to_csv(&#39;D:/FYP/WIND_Prev_dir/WB8.csv&#39;)
driver.close()

This is the code used for displaying the scraped table but not sure how to turn it to csv:

# Obtain the number of rows in body
rows = len(driver.find_elements(By.XPATH,
    &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr&quot;))
 
# Obtain the number of columns in table
cols = len(driver.find_elements(By.XPATH,
    &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[1]/th&quot;))
 
# Print rows and columns
print(rows)
print(cols)
 
# Printing the table headers
months = print(&quot;        &quot; + &quot;Jan&quot; + &quot;       &quot; + &quot;Feb&quot; + &quot;       &quot; + &quot;Mar&quot; + &quot;       &quot; + &quot;Apr&quot; + &quot;       &quot; + &quot;May&quot; + &quot;       &quot; + &quot;Jun&quot; + &quot;       &quot; + &quot;Jul&quot; + &quot;       &quot; + &quot;Aug&quot; + &quot;       &quot; + &quot;Sep&quot; + &quot;       &quot; + &quot;Oct&quot; + &quot;       &quot; + &quot;Nov&quot; + &quot;       &quot; + &quot;Dec&quot;)

# Printing the data of the table
for r in range(2, rows+1):
    for p in range(1, cols+1):
        # obtaining the text from each column of the table
        value = driver.find_element(By.XPATH,
            &quot;/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/table/tr[&quot;+str(r)+&quot;]/td[&quot;+str(p)+&quot;]&quot;).text
        if value == &quot; &quot;:
            print (&quot;N.A.&quot;, end = &#39;       &#39;)
        else:
            print(value, end=&#39;       &#39;)
        #df.append(value)
    print()

答案1

得分: 0

这是如何抓取整个表格数据并将其保存到CSV文件中的方法。

import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC

driver = Chrome()

url = 'https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&amp;ele=PREV_DIR&amp;y=2022'
driver.get(url)

table = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table[id="t1"] > tr')))
columns = [i.text for i in table[0].find_elements(By.TAG_NAME, 'th')]
table_dict = {col: [] for col in columns}

for row in table[1:]:
    for data in zip(columns, [i.text for i in row.find_elements(By.TAG_NAME, 'td')]):
        table_dict[data[0]].append(data[1])

driver.close()

df = pd.DataFrame(table_dict)
# # 将数据框保存为csv
df.to_csv('data.csv', index=False)

英文:

Here's how you can scrape the whole table data and save them into a CSV file.

import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC

driver = Chrome()

url = &#39;https://www.hko.gov.hk/en/cis/awsDailyElement.htm?stn=WB8&amp;ele=PREV_DIR&amp;y=2022&#39;
driver.get(url)

table = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, &#39;table[id=&quot;t1&quot;] &gt; tr&#39;)))
columns = [i.text for i in table[0].find_elements(By.TAG_NAME, &#39;th&#39;)]
table_dict = {col: [] for col in columns}

for row in table[1:]:
    for data in zip(columns, [i.text for i in row.find_elements(By.TAG_NAME, &#39;td&#39;)]):
        table_dict[data[0]].append(data[1])

driver.close()

df = pd.DataFrame(table_dict)
# # saving the dataframe to a csv
df.to_csv(&#39;data.csv&#39;, index=False)

Few things to note:

After hitting the URL, we need to wait for the table to get visibly located on the page and thus we find all the table rows tr which includes the first tr as the table's columns.
the variable columns is a list that holds the table column names (first row data table[0])
Next, we initiate a variable table_dict and assign the columns as the key of this dict with their values as an empty list.
after that, we iterate over the remaining rows of the table, couple the list of columns with the row data and iterate over it to assign the data to its column.
and finally, create a dataframe with table_dict and save it into a CSV file data.csv.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python使用Selenium进行网页抓取返回空列表。

问题

答案1

How to create a function which replace an empty values with the most appearing value or average value based on the specific columns

`driver.find_element(By.XPATH, “xpath”)` 不起作用

Pylint与VSCode和Python在导入方面存在不一致。

有人可以帮助我吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论