Webscraping code failing on similar pages.

huangapple go评论66阅读模式
英文:

Webscraping code failing on similar pages

问题

Title

代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from random import randint

theurl = "http://ufcstats.com/event-details/7abe471b61725980"

r=requests.get(theurl)

soup=BeautifulSoup(r.text,'html.parser')
Name=soup.find(class_='b-fight-details__table-body')
Name=Name.text.strip()
links=soup.find_all('a')

# print(links)
Fighter = []
for link in links:
    href=link['href']
    if href:
        print(href)
        if 'fighter-details' in href:
            Fighter.append(href)
            print(Fighter)

在旧事件中运行正常:

http://ufcstats.com/event-details/6f812143641ceff8

但对于新事件不起作用?

http://ufcstats.com/event-details/7abe471b61725980

我收到以下错误:

    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'href'

但是它们是相同的网页吗?为什么 [href] 给我一个错误,明明在 'a' 标签中有,我尝试从 a 标签中剥离文本,但似乎也不起作用。

英文:

Title

Code:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from random import randint


theurl = "http://ufcstats.com/event-details/7abe471b61725980"

r=requests.get(theurl)

soup=BeautifulSoup(r.text,'html.parser')
Name=soup.find(class_='b-fight-details__table-body')
Name=Name.text.strip()
links=soup.find_all('a')

# print(links)
Fighter = []
for link in links:
    href=link['href']
    if href:
        print(href)
        if r'fighter-details' in href:
            Fighter.append(href)
            print(Fighter)

Works perfectly for old events:

http://ufcstats.com/event-details/6f812143641ceff8

But not a new event?

http://ufcstats.com/event-details/7abe471b61725980

I get the following error:

    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'href'

But there the same webpage? Why does [href] give me an error, its clearly there in the 'a' tag, I tried to strip out the text from the a tag, but doesn't seem to want to work either.

答案1

得分: 0

在表格中,有些链接没有 href= 属性,因此您的脚本会失败。修复的一种方法是使用带有默认值的 dict.get()

import requests
from bs4 import BeautifulSoup

theurl = "http://ufcstats.com/event-details/7abe471b61725980"
soup = BeautifulSoup(requests.get(theurl).text, 'html.parser')

Name = soup.find(class_='b-fight-details__table-body')
links = Name.find_all('a')

Fighter = []
for link in links:
    href = link.get('href', '')  # < -- 获取 href= 属性,如果属性不存在则返回空字符串
    if href:
        if 'fighter-details' in href:
            Fighter.append(href)

print(*Fighter, sep='\n')

打印输出:

http://ufcstats.com/fighter-details/853eb0dd5c0e2149
http://ufcstats.com/fighter-details/6d35bf94f7d30241
http://ufcstats.com/fighter-details/7aa3d6964eff4877
http://ufcstats.com/fighter-details/361d49960a196976
http://ufcstats.com/fighter-details/d1941565abf50b16
http://ufcstats.com/fighter-details/7026eca45f65377b

...等等。
英文:

In the table there are links without the href= attribute so your script fails. One way to fix it is to use dict.get() with default value:

import requests
from bs4 import BeautifulSoup

theurl = &quot;http://ufcstats.com/event-details/7abe471b61725980&quot;
soup=BeautifulSoup(requests.get(theurl).text,&#39;html.parser&#39;)

Name=soup.find(class_=&#39;b-fight-details__table-body&#39;)
links=Name.find_all(&#39;a&#39;)

Fighter = []
for link in links:
    href=link.get(&#39;href&#39;, &#39;&#39;)  # &lt;-- get href= attribute or empty string if the attribute doesn&#39;t exist
    if href:
        if &#39;fighter-details&#39; in href:
            Fighter.append(href)

print(*Fighter, sep=&#39;\n&#39;)

Prints:

http://ufcstats.com/fighter-details/853eb0dd5c0e2149
http://ufcstats.com/fighter-details/6d35bf94f7d30241
http://ufcstats.com/fighter-details/7aa3d6964eff4877
http://ufcstats.com/fighter-details/361d49960a196976
http://ufcstats.com/fighter-details/d1941565abf50b16
http://ufcstats.com/fighter-details/7026eca45f65377b

...and so on.

huangapple
  • 本文由 发表于 2023年5月30日 04:58:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76360294.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定