将数据转换为文本是在网页抓取时的一项重要任务。

huangapple go评论68阅读模式
英文:

How to convert data into text while web scraping?

问题

以下是您要翻译的内容:

我想使用Python从一个网站抓取足球球员的数据,并将其提取到一个Excel文件中。

代码可以运行,并且可以提取信息,但提取的不是文本,而是HTML代码。为了将其“转换”为文本,我使用了.text修饰符,但出现了错误消息

"AttributeError: 'NoneType' object has no attribute 'text'"

请问有谁可以帮忙吗?我的目标是在Excel文件中拥有姓名、俱乐部、分钟等信息。

英文:

I want to scrape a website and extract the data for football players into an excel file using python.

The code works and the information is extracted but not as a text but as the HTML code. To "convert" it into text I used the .text modifier which results in an error message

"AttributeError: 'NoneType' object has no attribute 'text'"

Could anyone of you please help. My goal is to have the name, club, minutes, etc. in an excel file.

from bs4 import BeautifulSoup
import requests
import pandas as pd

def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")

    players = soup.find("table", 
    			class_="module-statistics statistics")
    
    data=[]
    
    for player in players:
    	item={}
    
    	item["Name"] = player.find("td", 
    				class_="person-name").text
    	item["Verein"] = player.find("td", 
    				class_="team-name")
    
    	item["Minuten"] = player.find("td", 
    				class_="person_stats-playing_minutes person_stats-playing_minutes-list")
    
    	item["Ballkontakte pro Minute"] = player.find("td", 
    				class_="person_stats-balls_touched_per_minute")
    
    	item["Summe Ballkontakte"] = player.find("td", 
    				class_="person_stats-balls_touched person_stats-balls_touched-list")
    								
    	
    	data.append(item)
    
    return data

def export_data(data):
    df = pd.DataFrame(data)
    df.to_excel("Spieler.xlsx")

if __name__ == "__main__":
    data = get_data("https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/")
export_data(data)
    print("done")

答案1

得分: 1

以下是您要翻译的内容:

主要问题是选择更具体的内容,并排除包含<th>的行,因为在这里您将找不到任何<td>

players = soup.select('.module-statistics.statistics tr:has(td)')

data=[]

for player in players:
    data.append({
    'Name': player.find("td", class_="person-name").text,
    'Verein': player.find("td", class_="team-name").text,
    'Minuten': player.find("td", class_="person_stats-playing_minutes person_stats-playing_minutes-list").text,
    'Ballkontakte pro Minute': player.find("td", class_="person_stats-balls_touched_per_minute").text,
    'Summe Ballkontakte': player.find("td", class_="person_stats-balls_touched person_stats-balls_touched-list").text                                    
    })

return data

另一种在爬取表格时的最佳实践是使用pandas.read_html()

import pandas as pd
pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/')[0]

根据@Timeless的评论,您可以添加额外的参数,以获得正确的结果:

pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/', decimal=',', thousands='.')[0]
英文:

Main issue here is to select more specific and exclude the row with the <th> because here you will not find any <td>:

players = soup.select('.module-statistics.statistics tr:has(td)')

data=[]

for player in players:
    data.append({
    'Name': player.find("td", class_="person-name").text,
    'Verein': player.find("td", class_="team-name").text,
    'Minuten': player.find("td", class_="person_stats-playing_minutes person_stats-playing_minutes-list").text,
    'Ballkontakte pro Minute': player.find("td", class_="person_stats-balls_touched_per_minute").text,
    'Summe Ballkontakte': player.find("td", class_="person_stats-balls_touched person_stats-balls_touched-list").text                                    
    })

return data

An alternative and best practice in scraping tables is to use pandas.read_html():

import pandas as pd
pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/')[0]

Based on @Timeless comment, you could add additional parameters, to get a propper result:

pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/', decimal=',', thousands='.')[0]

答案2

得分: 0

第一个 `player` 元素在你遍历元素列表时很可能是空的为了处理这种情况这里是一个快速修复

```python
from bs4 import BeautifulSoup
import requests

def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")

    players = soup.find("table", class_="module-statistics statistics")

    data = []

    for player in players:
        item = {}

        name = player.find("td", class_="person-name")
        team_name = player.find("td", class_="team-name")

        item["Name"] = name.text.strip() if name else ""
        item["Verein"] = team_name.text.strip() if team_name else ""

        # ... and so on

        data.append(item)

    return data

if __name__ == "__main__":
    data = get_data(
        "https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/"
    )

<details>
<summary>英文:</summary>

The first `player` element was most likely empty when you were iterating the list of elements. To handle that here is a quick fix.

```python
from bs4 import BeautifulSoup
import requests


def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, &quot;lxml&quot;)

    players = soup.find(&quot;table&quot;, class_=&quot;module-statistics statistics&quot;)

    data = []

    for player in players:
        item = {}

        name = player.find(&quot;td&quot;, class_=&quot;person-name&quot;)
        team_name = player.find(&quot;td&quot;, class_=&quot;team-name&quot;)

        item[&quot;Name&quot;] = name.text.strip() if name else &quot;&quot;
        item[&quot;Verein&quot;] = team_name.text.strip() if team_name else &quot;&quot;

        # ... and so on

        data.append(item)

    return data


if __name__ == &quot;__main__&quot;:
    data = get_data(
        &quot;https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/&quot;
    )

答案3

得分: 0

你需要使用 findNext 而不是 find。下面的代码应该正常工作:

from bs4 import BeautifulSoup
import requests
import pandas as pd

def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    players = soup.find("table", class_="module-statistics statistics")
    data = []
    for player in players:
        item = {}
        item["Name"] = player.find_next("td", class_="person-name").text
        item["Verein"] = player.find_next("td", class_="team-name").text
        item["Minuten"] = player.find_next("td",
                                           class_="person_stats-playing_minutes person_stats-playing_minutes-list").text
        item["Ballkontakte pro Minute"] = player.find_next("td",
                                                           class_="person_stats-balls_touched_per_minute").text

        item["Summe Ballkontakte"] = player.find_next("td",
                                                      class_="person_stats-balls_touched "
                                                             "person_stats-balls_touched-list").text
        data.append(item)
    return data

def export_data(data):
    df = pd.DataFrame(data)
    df.to_excel("Spieler.xlsx")

if __name__ == "__main__":
    data = get_data(
        "https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/")
    export_data(data)
    print("done")
英文:

You have to use findNext instead of find
Try Below code should work fine

from bs4 import BeautifulSoup
import requests
import pandas as pd


def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, &quot;lxml&quot;)
    players = soup.find(&quot;table&quot;, class_=&quot;module-statistics statistics&quot;)
    data = []
    for player in players:
        item = {}
        item[&quot;Name&quot;] = player.find_next(&quot;td&quot;, class_=&quot;person-name&quot;).text
        item[&quot;Verein&quot;] = player.find_next(&quot;td&quot;, class_=&quot;team-name&quot;).text
        item[&quot;Minuten&quot;] = player.find_next(&quot;td&quot;,
                                           class_=&quot;person_stats-playing_minutes person_stats-playing_minutes-list&quot;).text
        item[&quot;Ballkontakte pro Minute&quot;] = player.find_next(&quot;td&quot;, class_=&quot;person_stats-balls_touched_per_minute&quot;).text

        item[&quot;Summe Ballkontakte&quot;] = player.find_next(&quot;td&quot;,
                                                      class_=&quot;person_stats-balls_touched &quot;
                                                             &quot;person_stats-balls_touched-list&quot;).text
        data.append(item)
    return data


def export_data(data):
    df = pd.DataFrame(data)
    df.to_excel(&quot;Spieler.xlsx&quot;)


if __name__ == &quot;__main__&quot;:
    data = get_data(
        &quot;https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/&quot;)
    export_data(data)
    print(&quot;done&quot;)

答案4

得分: 0

请使用.content代替.text用于BeautifulSoup。将代码修改为soup=BeautifulSoup(response.content,"lxml")

英文:

Try using .content instead of .text for BeautifulSoup. It will be soup=BeautifulSoup(response.content,&quot;lxml&quot;).

huangapple
  • 本文由 发表于 2023年5月14日 20:18:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76247438.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定