2023年3月21日 01:56:24go评论92阅读模式

英文:

Python: Scrape href from td - can't get it to work correctly

问题

以下是您要翻译的代码部分：

我对Python非常不熟悉，并查看了以前在SO上的问题，但无法解决。 这是我的代码：
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse
url = "https://en.wikipedia.org/wiki/List_of_curling_clubs_in_the_United_States"
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', class_='wikitable sortable')
df = pd.DataFrame(columns=['Club Name', 'City/Town', 'State', 'Type', 'Sheets', 'Memberships', 'Year Founded', 'Notes', 'URL'])
for row in table.tbody.find_all('tr'):    
    # 找到每一列的所有数据
    columns = row.find_all('td')
    
    if(columns != []):
        club_name = columns[0].text.strip()
        city = columns[1].text.strip()
        state = columns[2].text.strip()
        type_arena = columns[3].text.strip()
        sheets = columns[4].text.strip()
        memberships = columns[5].text.strip()
        year_founded = columns[6].text.strip()
        notes = columns[7].text.strip()
        club_url = columns[0].find('a').get('href')
        
        df = df.append({'Club Name': club_name, 'City/Town': city, 'State': state, 'Type': type_arena, 'Sheets': sheets, 'Memberships': memberships, 'Year Founded': year_founded, 'Notes': notes, 'URL': club_url}, ignore_index=True)

关于您的问题，似乎在最后一列出现问题。它返回“None”，但第一列显然包含一个链接。您应该如何解决这个问题呢？

我已经为您提供了代码的翻译。如果您有任何关于代码的问题或需要进一步的帮助，请告诉我。

英文:

I'm very new to python and have gone through previous questions on SO but could not solve it. Here is my code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse
url = &quot;https://en.wikipedia.org/wiki/List_of_curling_clubs_in_the_United_States&quot;
data = requests.get(url).text
soup = BeautifulSoup(data, &#39;lxml&#39;)
table = soup.find(&#39;table&#39;, class_=&#39;wikitable sortable&#39;)
df = pd.DataFrame(columns=[&#39;Club Name&#39;, &#39;City/Town&#39;, &#39;State&#39;, &#39;Type&#39;, &#39;Sheets&#39;, &#39;Memberships&#39;, &#39;Year Founded&#39;, &#39;Notes&#39;, &#39;URL&#39;])
for row in table.tbody.find_all(&#39;tr&#39;):    
    # Find all data for each column
    columns = row.find_all(&#39;td&#39;)
    
    if(columns != []):
        club_name = columns[0].text.strip()
        city = columns[1].text.strip()
        state = columns[2].text.strip()
        type_arena = columns[3].text.strip()
        sheets = columns[4].text.strip()
        memberships = columns[5].text.strip()
        year_founded = columns[6].text.strip()
        notes = columns[7].text.strip()
        club_url = columns[0].find(&#39;a&#39;).get(&#39;href&#39;)
        
        df = df.append({&#39;Club Name&#39;: club_name,  &#39;City/Town&#39;: city, &#39;State&#39;: state, &#39;Type&#39;: type_arena, &#39;Sheets&#39;: sheets, &#39;Memberships&#39;: memberships, &#39;Year Founded&#39;: year_founded, &#39;Notes&#39;: notes, &#39;URL&#39;: club_url}, ignore_index=True)

My DF works except for the final column. It returns "None" when the first column obviously contains a link. How do I resolve this?

I've successfully scraped HREF from websites without tables, but am struggling to find a solution inside the table. Thanks in advance!

答案1

得分: 0

有一个拼写错误在你的脚本中：

club_url = cols[0].find('a').get('href')

cols 应该改成 columns 并且在应用方法之前应该检查元素是否存在：

club_url = columns[0].find('a').get('href') if columns[0].find('a') else None

英文:

There is a typo in your script:

club_url = cols[0].find(&#39;a&#39;).get(&#39;href&#39;)

cols should be columns and you should check if element exists before apply a method:

club_url = columns[0].find(&#39;a&#39;).get(&#39;href&#39;) if columns[0].find(&#39;a&#39;) else None

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrape href from td – can’t get it to work correctly.

问题

答案1

Python Shiny：如何使用两个按钮切换条件面板的可见性？

重组边缘/边角到它们相应的多边形。

优化爬虫

重复使用 BigQuery 查询作业作为基础查询，以供进一步操作使用。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。