2023年5月30日 04:58:54go评论74阅读模式

英文:

Webscraping code failing on similar pages

问题

Title

代码：

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from random import randint

theurl = "http://ufcstats.com/event-details/7abe471b61725980"

r=requests.get(theurl)

soup=BeautifulSoup(r.text,'html.parser')
Name=soup.find(class_='b-fight-details__table-body')
Name=Name.text.strip()
links=soup.find_all('a')

# print(links)
Fighter = []
for link in links:
    href=link['href']
    if href:
        print(href)
        if 'fighter-details' in href:
            Fighter.append(href)
            print(Fighter)

在旧事件中运行正常：

http://ufcstats.com/event-details/6f812143641ceff8

但对于新事件不起作用？

http://ufcstats.com/event-details/7abe471b61725980

我收到以下错误：

    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'href'

但是它们是相同的网页吗？为什么 [href] 给我一个错误，明明在 'a' 标签中有，我尝试从 a 标签中剥离文本，但似乎也不起作用。

英文:

Title

Code:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from random import randint


theurl = &quot;http://ufcstats.com/event-details/7abe471b61725980&quot;

r=requests.get(theurl)

soup=BeautifulSoup(r.text,&#39;html.parser&#39;)
Name=soup.find(class_=&#39;b-fight-details__table-body&#39;)
Name=Name.text.strip()
links=soup.find_all(&#39;a&#39;)

# print(links)
Fighter = []
for link in links:
    href=link[&#39;href&#39;]
    if href:
        print(href)
        if r&#39;fighter-details&#39; in href:
            Fighter.append(href)
            print(Fighter)

Works perfectly for old events:

http://ufcstats.com/event-details/6f812143641ceff8

But not a new event?

http://ufcstats.com/event-details/7abe471b61725980

I get the following error:

    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: &#39;href&#39;

But there the same webpage? Why does [href] give me an error, its clearly there in the 'a' tag, I tried to strip out the text from the a tag, but doesn't seem to want to work either.

答案1

得分: 0

在表格中，有些链接没有 href= 属性，因此您的脚本会失败。修复的一种方法是使用带有默认值的 dict.get()：

import requests
from bs4 import BeautifulSoup

theurl = "http://ufcstats.com/event-details/7abe471b61725980"
soup = BeautifulSoup(requests.get(theurl).text, 'html.parser')

Name = soup.find(class_='b-fight-details__table-body')
links = Name.find_all('a')

Fighter = []
for link in links:
    href = link.get('href', '')  # < -- 获取 href= 属性，如果属性不存在则返回空字符串
    if href:
        if 'fighter-details' in href:
            Fighter.append(href)

print(*Fighter, sep='\n')

打印输出：

http://ufcstats.com/fighter-details/853eb0dd5c0e2149
http://ufcstats.com/fighter-details/6d35bf94f7d30241
http://ufcstats.com/fighter-details/7aa3d6964eff4877
http://ufcstats.com/fighter-details/361d49960a196976
http://ufcstats.com/fighter-details/d1941565abf50b16
http://ufcstats.com/fighter-details/7026eca45f65377b

...等等。

英文:

In the table there are links without the href= attribute so your script fails. One way to fix it is to use dict.get() with default value:

import requests
from bs4 import BeautifulSoup

theurl = &quot;http://ufcstats.com/event-details/7abe471b61725980&quot;
soup=BeautifulSoup(requests.get(theurl).text,&#39;html.parser&#39;)

Name=soup.find(class_=&#39;b-fight-details__table-body&#39;)
links=Name.find_all(&#39;a&#39;)

Fighter = []
for link in links:
    href=link.get(&#39;href&#39;, &#39;&#39;)  # &lt;-- get href= attribute or empty string if the attribute doesn&#39;t exist
    if href:
        if &#39;fighter-details&#39; in href:
            Fighter.append(href)

print(*Fighter, sep=&#39;\n&#39;)

Prints:

http://ufcstats.com/fighter-details/853eb0dd5c0e2149
http://ufcstats.com/fighter-details/6d35bf94f7d30241
http://ufcstats.com/fighter-details/7aa3d6964eff4877
http://ufcstats.com/fighter-details/361d49960a196976
http://ufcstats.com/fighter-details/d1941565abf50b16
http://ufcstats.com/fighter-details/7026eca45f65377b

...and so on.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Webscraping code failing on similar pages.

问题

答案1

我无法在基于类的视图中在方法之间传递数值。

替换一个列中的数值，如果另一个列满足条件。

ruamel yaml 克隆一个节点而不合并锚点

遇到使用Scrapy时被阻止（使用用户代理）

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论