英文:
Problem with getting the xlsx tables from a website with BeautifulSoup
问题
抱歉,你提供的代码部分是英文,我会翻译成中文:
我是Python的新手,正在尝试找到从 https://www.sba.gov/document/report-sba-disaster-loan-data 网站获取2010年到2022年之间的xlsx表格的最佳方法。然后,我想将这些表格合并到一个单独的数据框中,并添加一个名为"year"的列,指示数据的财政年度。提前谢谢!
我尝试获取所有a href链接,但出现了以下TypeError错误:
import requests
from bs4 import BeautifulSoup
web_url = "https://www.sba.gov/document/report-sba-disaster-loan-data"
html = requests.get(web_url).content
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find('div', {'class': 'jHSEzIJePQkATFwBbUD8j'})
for table in tables:
link = cols[1].find('a').get('href')
print(link)
TypeError: 'NoneType'对象不可迭代。
英文:
I'm new in python and trying to find the best way to get the xlsx tables for years between 2010 to 2022 from https://www.sba.gov/document/report-sba-disaster-loan-data website. Then, I want to bring those tables together into a single dataframe with a "year" column added indicating the fiscal year of the data. Thank you in advance!
I've tried to get all the a href links but it gave me the typeerror below
import requests
from bs4 import BeautifulSoup
web_url = "https://www.sba.gov/document/report-sba-disaster-loan-data"
html = requests.get(web_url).content
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find('div', {'class': 'jHSEzIJePQkATFwBbUD8j'})
for table in tables:
link = cols[1].find('a').get('href')
print(link)
TypeError: 'NoneType' object is not iterable
答案1
得分: 2
以下是翻译好的部分:
"file URLs are loaded from external address via JavaScript." - 文件URL是通过JavaScript从外部地址加载的。
"To get the .xlsx URLs you can use this example:" - 要获取.xlsx的URL,您可以使用以下示例:
"Prints:" - 打印:
"To get a pandas dataframe you can store the dates and urls into a list and use e.g. pandas.read_excel function." - 要获取一个pandas数据框,您可以将日期和URL存储到列表中,然后使用例如pandas.read_excel
函数。
英文:
The file URLs are loaded from external address via JavaScript. To get the .xlsx
URLs you can use this example:
import re
import requests
url = 'https://www.sba.gov/document/report-sba-disaster-loan-data'
api_url = 'https://www.sba.gov/api/content/{node_id}.json'
html_doc = requests.get(url).text
node_id = re.search(r'nodeId = "(\d+)"', html_doc).group(1)
data = requests.get(api_url.format(node_id=node_id)).json()
for f in data['files']:
print(f['effectiveDate'], 'https://www.sba.gov' + f['fileUrl'])
Prints:
2022-02-11 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY21.xlsx
2021-03-15 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY20.xlsx
2020-04-10 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY19.xlsx
2019-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY18.xlsx
2018-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY17_Update_033118.xlsx
2017-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY16.xlsx
2016-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY15.xlsx
2015-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY14.xlsx
2014-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY13.xlsx
2014-09-23 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_Superstorm_Sandy.xlsx
2013-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY12.xlsx
2012-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY11.xlsx
2011-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY10.xlsx
2010-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY09.xlsx
2009-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY08.xlsx
2008-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY07.xlsx
2007-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY06.xlsx
2006-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY05.xlsx
2005-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY04.xlsx
2004-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY03.xls
2003-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY02.xls
2002-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY01.xls
2001-10-01 https://www.sba.gov/sites/default/files/2021-05/SBA_Disaster_Loan_Data_FY00.xlsx
To get a pandas dataframe you can store the dates and urls into a list and use e.g. pandas.read_excel
function.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论