2023年5月14日 20:18:33go评论94阅读模式

英文:

How to convert data into text while web scraping?

问题

以下是您要翻译的内容：

我想使用Python从一个网站抓取足球球员的数据，并将其提取到一个Excel文件中。

代码可以运行，并且可以提取信息，但提取的不是文本，而是HTML代码。为了将其“转换”为文本，我使用了.text修饰符，但出现了错误消息

"AttributeError: 'NoneType' object has no attribute 'text'"

请问有谁可以帮忙吗？我的目标是在Excel文件中拥有姓名、俱乐部、分钟等信息。

英文:

I want to scrape a website and extract the data for football players into an excel file using python.

The code works and the information is extracted but not as a text but as the HTML code. To "convert" it into text I used the .text modifier which results in an error message

&quot;AttributeError: &#39;NoneType&#39; object has no attribute &#39;text&#39;&quot;

Could anyone of you please help. My goal is to have the name, club, minutes, etc. in an excel file.

from bs4 import BeautifulSoup
import requests
import pandas as pd

def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,&quot;lxml&quot;)

    players = soup.find(&quot;table&quot;, 
    			class_=&quot;module-statistics statistics&quot;)
    
    data=[]
    
    for player in players:
    	item={}
    
    	item[&quot;Name&quot;] = player.find(&quot;td&quot;, 
    				class_=&quot;person-name&quot;).text
    	item[&quot;Verein&quot;] = player.find(&quot;td&quot;, 
    				class_=&quot;team-name&quot;)
    
    	item[&quot;Minuten&quot;] = player.find(&quot;td&quot;, 
    				class_=&quot;person_stats-playing_minutes person_stats-playing_minutes-list&quot;)
    
    	item[&quot;Ballkontakte pro Minute&quot;] = player.find(&quot;td&quot;, 
    				class_=&quot;person_stats-balls_touched_per_minute&quot;)
    
    	item[&quot;Summe Ballkontakte&quot;] = player.find(&quot;td&quot;, 
    				class_=&quot;person_stats-balls_touched person_stats-balls_touched-list&quot;)
    								
    	
    	data.append(item)
    
    return data

def export_data(data):
    df = pd.DataFrame(data)
    df.to_excel(&quot;Spieler.xlsx&quot;)

if __name__ == &quot;__main__&quot;:
    data = get_data(&quot;https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/&quot;)
export_data(data)
    print(&quot;done&quot;)

答案1

得分: 1

以下是您要翻译的内容：

主要问题是选择更具体的内容，并排除包含<th>的行，因为在这里您将找不到任何<td>：

players = soup.select('.module-statistics.statistics tr:has(td)')

data=[]

for player in players:
    data.append({
    'Name': player.find("td", class_="person-name").text,
    'Verein': player.find("td", class_="team-name").text,
    'Minuten': player.find("td", class_="person_stats-playing_minutes person_stats-playing_minutes-list").text,
    'Ballkontakte pro Minute': player.find("td", class_="person_stats-balls_touched_per_minute").text,
    'Summe Ballkontakte': player.find("td", class_="person_stats-balls_touched person_stats-balls_touched-list").text                                    
    })

return data

另一种在爬取表格时的最佳实践是使用pandas.read_html()：

import pandas as pd
pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/')[0]

根据@Timeless的评论，您可以添加额外的参数，以获得正确的结果：

pd.read_html('https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/', decimal=',', thousands='.')[0]

英文:

Main issue here is to select more specific and exclude the row with the <th> because here you will not find any <td>:

players = soup.select(&#39;.module-statistics.statistics tr:has(td)&#39;)

data=[]

for player in players:
    data.append({
    &#39;Name&#39;: player.find(&quot;td&quot;, class_=&quot;person-name&quot;).text,
    &#39;Verein&#39;: player.find(&quot;td&quot;, class_=&quot;team-name&quot;).text,
    &#39;Minuten&#39;: player.find(&quot;td&quot;, class_=&quot;person_stats-playing_minutes person_stats-playing_minutes-list&quot;).text,
    &#39;Ballkontakte pro Minute&#39;: player.find(&quot;td&quot;, class_=&quot;person_stats-balls_touched_per_minute&quot;).text,
    &#39;Summe Ballkontakte&#39;: player.find(&quot;td&quot;, class_=&quot;person_stats-balls_touched person_stats-balls_touched-list&quot;).text                                    
    })

return data

An alternative and best practice in scraping tables is to use pandas.read_html():

import pandas as pd
pd.read_html(&#39;https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/&#39;)[0]

Based on @Timeless comment, you could add additional parameters, to get a propper result:

pd.read_html(&#39;https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/&#39;, decimal=&#39;,&#39;, thousands=&#39;.&#39;)[0]

答案2

得分: 0

第一个 `player` 元素在你遍历元素列表时很可能是空的。为了处理这种情况，这里是一个快速修复。

```python
from bs4 import BeautifulSoup
import requests

def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")

    players = soup.find("table", class_="module-statistics statistics")

    data = []

    for player in players:
        item = {}

        name = player.find("td", class_="person-name")
        team_name = player.find("td", class_="team-name")

        item["Name"] = name.text.strip() if name else ""
        item["Verein"] = team_name.text.strip() if team_name else ""

        # ... and so on

        data.append(item)

    return data

if __name__ == "__main__":
    data = get_data(
        "https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/"
    )


<details>
<summary>英文:</summary>

The first `player` element was most likely empty when you were iterating the list of elements. To handle that here is a quick fix.

```python
from bs4 import BeautifulSoup
import requests


def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, &quot;lxml&quot;)

    players = soup.find(&quot;table&quot;, class_=&quot;module-statistics statistics&quot;)

    data = []

    for player in players:
        item = {}

        name = player.find(&quot;td&quot;, class_=&quot;person-name&quot;)
        team_name = player.find(&quot;td&quot;, class_=&quot;team-name&quot;)

        item[&quot;Name&quot;] = name.text.strip() if name else &quot;&quot;
        item[&quot;Verein&quot;] = team_name.text.strip() if team_name else &quot;&quot;

        # ... and so on

        data.append(item)

    return data


if __name__ == &quot;__main__&quot;:
    data = get_data(
        &quot;https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/&quot;
    )

答案3

得分: 0

你需要使用 findNext 而不是 find。下面的代码应该正常工作：

from bs4 import BeautifulSoup
import requests
import pandas as pd

def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    players = soup.find("table", class_="module-statistics statistics")
    data = []
    for player in players:
        item = {}
        item["Name"] = player.find_next("td", class_="person-name").text
        item["Verein"] = player.find_next("td", class_="team-name").text
        item["Minuten"] = player.find_next("td",
                                           class_="person_stats-playing_minutes person_stats-playing_minutes-list").text
        item["Ballkontakte pro Minute"] = player.find_next("td",
                                                           class_="person_stats-balls_touched_per_minute").text

        item["Summe Ballkontakte"] = player.find_next("td",
                                                      class_="person_stats-balls_touched "
                                                             "person_stats-balls_touched-list").text
        data.append(item)
    return data

def export_data(data):
    df = pd.DataFrame(data)
    df.to_excel("Spieler.xlsx")

if __name__ == "__main__":
    data = get_data(
        "https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/")
    export_data(data)
    print("done")

英文:

You have to use findNext instead of find
Try Below code should work fine

from bs4 import BeautifulSoup
import requests
import pandas as pd


def get_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, &quot;lxml&quot;)
    players = soup.find(&quot;table&quot;, class_=&quot;module-statistics statistics&quot;)
    data = []
    for player in players:
        item = {}
        item[&quot;Name&quot;] = player.find_next(&quot;td&quot;, class_=&quot;person-name&quot;).text
        item[&quot;Verein&quot;] = player.find_next(&quot;td&quot;, class_=&quot;team-name&quot;).text
        item[&quot;Minuten&quot;] = player.find_next(&quot;td&quot;,
                                           class_=&quot;person_stats-playing_minutes person_stats-playing_minutes-list&quot;).text
        item[&quot;Ballkontakte pro Minute&quot;] = player.find_next(&quot;td&quot;, class_=&quot;person_stats-balls_touched_per_minute&quot;).text

        item[&quot;Summe Ballkontakte&quot;] = player.find_next(&quot;td&quot;,
                                                      class_=&quot;person_stats-balls_touched &quot;
                                                             &quot;person_stats-balls_touched-list&quot;).text
        data.append(item)
    return data


def export_data(data):
    df = pd.DataFrame(data)
    df.to_excel(&quot;Spieler.xlsx&quot;)


if __name__ == &quot;__main__&quot;:
    data = get_data(
        &quot;https://sportdaten.spiegel.de/fussball/bundesliga/ma9417803/fc-augsburg_eintracht-frankfurt/spielstatistik-ballkontakte/&quot;)
    export_data(data)
    print(&quot;done&quot;)

答案4

得分: 0

请使用.content代替.text用于BeautifulSoup。将代码修改为soup=BeautifulSoup(response.content,"lxml")。

英文:

Try using .content instead of .text for BeautifulSoup. It will be soup=BeautifulSoup(response.content,"lxml").

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将数据转换为文本是在网页抓取时的一项重要任务。

问题

答案1

答案2

答案3

答案4

寻找二进制列中的模式 r

在 Pandas 中如何使用带有 HAVING 子句的 GROUP_CONCAT？

Reading zip file content for later compute sha256 checksum fails.

如何在R中将多列写入一个列中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论