英文:
Python: Scrape href from td - can't get it to work correctly
问题
以下是您要翻译的代码部分:
我对Python非常不熟悉,并查看了以前在SO上的问题,但无法解决。 这是我的代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse
url = "https://en.wikipedia.org/wiki/List_of_curling_clubs_in_the_United_States"
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', class_='wikitable sortable')
df = pd.DataFrame(columns=['Club Name', 'City/Town', 'State', 'Type', 'Sheets', 'Memberships', 'Year Founded', 'Notes', 'URL'])
for row in table.tbody.find_all('tr'):
# 找到每一列的所有数据
columns = row.find_all('td')
if(columns != []):
club_name = columns[0].text.strip()
city = columns[1].text.strip()
state = columns[2].text.strip()
type_arena = columns[3].text.strip()
sheets = columns[4].text.strip()
memberships = columns[5].text.strip()
year_founded = columns[6].text.strip()
notes = columns[7].text.strip()
club_url = columns[0].find('a').get('href')
df = df.append({'Club Name': club_name, 'City/Town': city, 'State': state, 'Type': type_arena, 'Sheets': sheets, 'Memberships': memberships, 'Year Founded': year_founded, 'Notes': notes, 'URL': club_url}, ignore_index=True)
关于您的问题,似乎在最后一列出现问题。它返回“None”,但第一列显然包含一个链接。您应该如何解决这个问题呢?
我已经为您提供了代码的翻译。如果您有任何关于代码的问题或需要进一步的帮助,请告诉我。
英文:
I'm very new to python and have gone through previous questions on SO but could not solve it. Here is my code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse
url = "https://en.wikipedia.org/wiki/List_of_curling_clubs_in_the_United_States"
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', class_='wikitable sortable')
df = pd.DataFrame(columns=['Club Name', 'City/Town', 'State', 'Type', 'Sheets', 'Memberships', 'Year Founded', 'Notes', 'URL'])
for row in table.tbody.find_all('tr'):
# Find all data for each column
columns = row.find_all('td')
if(columns != []):
club_name = columns[0].text.strip()
city = columns[1].text.strip()
state = columns[2].text.strip()
type_arena = columns[3].text.strip()
sheets = columns[4].text.strip()
memberships = columns[5].text.strip()
year_founded = columns[6].text.strip()
notes = columns[7].text.strip()
club_url = columns[0].find('a').get('href')
df = df.append({'Club Name': club_name, 'City/Town': city, 'State': state, 'Type': type_arena, 'Sheets': sheets, 'Memberships': memberships, 'Year Founded': year_founded, 'Notes': notes, 'URL': club_url}, ignore_index=True)
My DF works except for the final column. It returns "None" when the first column obviously contains a link. How do I resolve this?
I've successfully scraped HREF from websites without tables, but am struggling to find a solution inside the table. Thanks in advance!
答案1
得分: 0
有一个拼写错误在你的脚本中:
club_url = cols[0].find('a').get('href')
cols
应该改成 columns
并且在应用方法之前应该检查元素是否存在:
club_url = columns[0].find('a').get('href') if columns[0].find('a') else None
英文:
There is a typo in your script:
club_url = cols[0].find('a').get('href')
cols
should be columns
and you should check if element exists before apply a method:
club_url = columns[0].find('a').get('href') if columns[0].find('a') else None
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论