英文:
How to convert data into text while web scraping?
问题
以下是您要翻译的内容:
我想使用Python从一个网站抓取足球球员的数据,并将其提取到一个Excel文件中。
代码可以运行,并且可以提取信息,但提取的不是文本,而是HTML代码。为了将其“转换”为文本,我使用了.text
修饰符,但出现了错误消息
"AttributeError: 'NoneType' object has no attribute 'text'"
请问有谁可以帮忙吗?我的目标是在Excel文件中拥有姓名、俱乐部、分钟等信息。
英文:
I want to scrape a website and extract the data for football players into an excel file using python.
The code works and the information is extracted but not as a text but as the HTML code. To "convert" it into text I used the .text
modifier which results in an error message
"AttributeError: 'NoneType' object has no attribute 'text'"
Could anyone of you please help. My goal is to have the name, club, minutes, etc. in an excel file.
from bs4 import BeautifulSoup
import requests
import pandas as pd
def get_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
players = soup.find("table",
class_="module-statistics statistics")
data=[]
for player in players:
item={}
item["Name"] = player.find("td",
class_="person-name").text
item["Verein"] = player.find("td",
class_="team-name")
item["Minuten"] = player.find("td",
class_="person_stats-playing_minutes person_stats-playing_minutes-list")
item["Ballkontakte pro Minute"] = player.find("td",
class_="person_stats-balls_touched_per_minute")
item["Summe Ballkontakte"] = player.find("td",
class_="person_stats-balls_touched person_stats-balls_touched-list")
data.append(item)
return data
def export_data(data):
df = pd.DataFrame(data)
df.to_excel("Spieler.xlsx")
if __name__ == "__main__":
data = get_data("https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/")
export_data(data)
print("done")
答案1
得分: 1
以下是您要翻译的内容:
主要问题是选择更具体的内容,并排除包含<th>
的行,因为在这里您将找不到任何<td>
:
players = soup.select('.module-statistics.statistics tr:has(td)')
data=[]
for player in players:
data.append({
'Name': player.find("td", class_="person-name").text,
'Verein': player.find("td", class_="team-name").text,
'Minuten': player.find("td", class_="person_stats-playing_minutes person_stats-playing_minutes-list").text,
'Ballkontakte pro Minute': player.find("td", class_="person_stats-balls_touched_per_minute").text,
'Summe Ballkontakte': player.find("td", class_="person_stats-balls_touched person_stats-balls_touched-list").text
})
return data
另一种在爬取表格时的最佳实践是使用pandas.read_html()
:
import pandas as pd
pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/')[0]
根据@Timeless的评论,您可以添加额外的参数,以获得正确的结果:
pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/', decimal=',', thousands='.')[0]
英文:
Main issue here is to select more specific and exclude the row with the <th>
because here you will not find any <td>
:
players = soup.select('.module-statistics.statistics tr:has(td)')
data=[]
for player in players:
data.append({
'Name': player.find("td", class_="person-name").text,
'Verein': player.find("td", class_="team-name").text,
'Minuten': player.find("td", class_="person_stats-playing_minutes person_stats-playing_minutes-list").text,
'Ballkontakte pro Minute': player.find("td", class_="person_stats-balls_touched_per_minute").text,
'Summe Ballkontakte': player.find("td", class_="person_stats-balls_touched person_stats-balls_touched-list").text
})
return data
An alternative and best practice in scraping tables is to use pandas.read_html()
:
import pandas as pd
pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/')[0]
Based on @Timeless comment, you could add additional parameters, to get a propper result:
pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/', decimal=',', thousands='.')[0]
答案2
得分: 0
第一个 `player` 元素在你遍历元素列表时很可能是空的。为了处理这种情况,这里是一个快速修复。
```python
from bs4 import BeautifulSoup
import requests
def get_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
players = soup.find("table", class_="module-statistics statistics")
data = []
for player in players:
item = {}
name = player.find("td", class_="person-name")
team_name = player.find("td", class_="team-name")
item["Name"] = name.text.strip() if name else ""
item["Verein"] = team_name.text.strip() if team_name else ""
# ... and so on
data.append(item)
return data
if __name__ == "__main__":
data = get_data(
"https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/"
)
<details>
<summary>英文:</summary>
The first `player` element was most likely empty when you were iterating the list of elements. To handle that here is a quick fix.
```python
from bs4 import BeautifulSoup
import requests
def get_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
players = soup.find("table", class_="module-statistics statistics")
data = []
for player in players:
item = {}
name = player.find("td", class_="person-name")
team_name = player.find("td", class_="team-name")
item["Name"] = name.text.strip() if name else ""
item["Verein"] = team_name.text.strip() if team_name else ""
# ... and so on
data.append(item)
return data
if __name__ == "__main__":
data = get_data(
"https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/"
)
答案3
得分: 0
你需要使用 findNext
而不是 find
。下面的代码应该正常工作:
from bs4 import BeautifulSoup
import requests
import pandas as pd
def get_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
players = soup.find("table", class_="module-statistics statistics")
data = []
for player in players:
item = {}
item["Name"] = player.find_next("td", class_="person-name").text
item["Verein"] = player.find_next("td", class_="team-name").text
item["Minuten"] = player.find_next("td",
class_="person_stats-playing_minutes person_stats-playing_minutes-list").text
item["Ballkontakte pro Minute"] = player.find_next("td",
class_="person_stats-balls_touched_per_minute").text
item["Summe Ballkontakte"] = player.find_next("td",
class_="person_stats-balls_touched "
"person_stats-balls_touched-list").text
data.append(item)
return data
def export_data(data):
df = pd.DataFrame(data)
df.to_excel("Spieler.xlsx")
if __name__ == "__main__":
data = get_data(
"https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/")
export_data(data)
print("done")
英文:
You have to use findNext
instead of find
Try Below code should work fine
from bs4 import BeautifulSoup
import requests
import pandas as pd
def get_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
players = soup.find("table", class_="module-statistics statistics")
data = []
for player in players:
item = {}
item["Name"] = player.find_next("td", class_="person-name").text
item["Verein"] = player.find_next("td", class_="team-name").text
item["Minuten"] = player.find_next("td",
class_="person_stats-playing_minutes person_stats-playing_minutes-list").text
item["Ballkontakte pro Minute"] = player.find_next("td", class_="person_stats-balls_touched_per_minute").text
item["Summe Ballkontakte"] = player.find_next("td",
class_="person_stats-balls_touched "
"person_stats-balls_touched-list").text
data.append(item)
return data
def export_data(data):
df = pd.DataFrame(data)
df.to_excel("Spieler.xlsx")
if __name__ == "__main__":
data = get_data(
"https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/")
export_data(data)
print("done")
答案4
得分: 0
请使用.content
代替.text
用于BeautifulSoup。将代码修改为soup=BeautifulSoup(response.content,"lxml")
。
英文:
Try using .content
instead of .text
for BeautifulSoup. It will be soup=BeautifulSoup(response.content,"lxml")
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论