英文:
Webscraping code failing on similar pages
问题
Title
代码:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from random import randint
theurl = "http://ufcstats.com/event-details/7abe471b61725980"
r=requests.get(theurl)
soup=BeautifulSoup(r.text,'html.parser')
Name=soup.find(class_='b-fight-details__table-body')
Name=Name.text.strip()
links=soup.find_all('a')
# print(links)
Fighter = []
for link in links:
href=link['href']
if href:
print(href)
if 'fighter-details' in href:
Fighter.append(href)
print(Fighter)
在旧事件中运行正常:
http://ufcstats.com/event-details/6f812143641ceff8
但对于新事件不起作用?
http://ufcstats.com/event-details/7abe471b61725980
我收到以下错误:
return self.attrs[key]
~~~~~~~~~~^^^^^
KeyError: 'href'
但是它们是相同的网页吗?为什么 [href] 给我一个错误,明明在 'a' 标签中有,我尝试从 a 标签中剥离文本,但似乎也不起作用。
英文:
Title
Code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from random import randint
theurl = "http://ufcstats.com/event-details/7abe471b61725980"
r=requests.get(theurl)
soup=BeautifulSoup(r.text,'html.parser')
Name=soup.find(class_='b-fight-details__table-body')
Name=Name.text.strip()
links=soup.find_all('a')
# print(links)
Fighter = []
for link in links:
href=link['href']
if href:
print(href)
if r'fighter-details' in href:
Fighter.append(href)
print(Fighter)
Works perfectly for old events:
http://ufcstats.com/event-details/6f812143641ceff8
But not a new event?
http://ufcstats.com/event-details/7abe471b61725980
I get the following error:
return self.attrs[key]
~~~~~~~~~~^^^^^
KeyError: 'href'
But there the same webpage? Why does [href] give me an error, its clearly there in the 'a' tag, I tried to strip out the text from the a tag, but doesn't seem to want to work either.
答案1
得分: 0
在表格中,有些链接没有 href=
属性,因此您的脚本会失败。修复的一种方法是使用带有默认值的 dict.get()
:
import requests
from bs4 import BeautifulSoup
theurl = "http://ufcstats.com/event-details/7abe471b61725980"
soup = BeautifulSoup(requests.get(theurl).text, 'html.parser')
Name = soup.find(class_='b-fight-details__table-body')
links = Name.find_all('a')
Fighter = []
for link in links:
href = link.get('href', '') # < -- 获取 href= 属性,如果属性不存在则返回空字符串
if href:
if 'fighter-details' in href:
Fighter.append(href)
print(*Fighter, sep='\n')
打印输出:
http://ufcstats.com/fighter-details/853eb0dd5c0e2149
http://ufcstats.com/fighter-details/6d35bf94f7d30241
http://ufcstats.com/fighter-details/7aa3d6964eff4877
http://ufcstats.com/fighter-details/361d49960a196976
http://ufcstats.com/fighter-details/d1941565abf50b16
http://ufcstats.com/fighter-details/7026eca45f65377b
...等等。
英文:
In the table there are links without the href=
attribute so your script fails. One way to fix it is to use dict.get()
with default value:
import requests
from bs4 import BeautifulSoup
theurl = "http://ufcstats.com/event-details/7abe471b61725980"
soup=BeautifulSoup(requests.get(theurl).text,'html.parser')
Name=soup.find(class_='b-fight-details__table-body')
links=Name.find_all('a')
Fighter = []
for link in links:
href=link.get('href', '') # <-- get href= attribute or empty string if the attribute doesn't exist
if href:
if 'fighter-details' in href:
Fighter.append(href)
print(*Fighter, sep='\n')
Prints:
http://ufcstats.com/fighter-details/853eb0dd5c0e2149
http://ufcstats.com/fighter-details/6d35bf94f7d30241
http://ufcstats.com/fighter-details/7aa3d6964eff4877
http://ufcstats.com/fighter-details/361d49960a196976
http://ufcstats.com/fighter-details/d1941565abf50b16
http://ufcstats.com/fighter-details/7026eca45f65377b
...and so on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论