从网站使用BeautifulSoup获取xlsx表格时出现问题。

huangapple go评论61阅读模式
英文:

Problem with getting the xlsx tables from a website with BeautifulSoup

问题

抱歉,你提供的代码部分是英文,我会翻译成中文:

我是Python的新手,正在尝试找到从 https://www.sba.gov/document/report-sba-disaster-loan-data 网站获取2010年到2022年之间的xlsx表格的最佳方法。然后,我想将这些表格合并到一个单独的数据框中,并添加一个名为"year"的列,指示数据的财政年度。提前谢谢!

我尝试获取所有a href链接,但出现了以下TypeError错误:

import requests
from bs4 import BeautifulSoup

web_url = "https://www.sba.gov/document/report-sba-disaster-loan-data"
html = requests.get(web_url).content
soup = BeautifulSoup(html, 'html.parser')

tables = soup.find('div', {'class': 'jHSEzIJePQkATFwBbUD8j'})
for table in tables:
    link = cols[1].find('a').get('href')
    print(link)

TypeError: 'NoneType'对象不可迭代。

英文:

I'm new in python and trying to find the best way to get the xlsx tables for years between 2010 to 2022 from https://www.sba.gov/document/report-sba-disaster-loan-data website. Then, I want to bring those tables together into a single dataframe with a "year" column added indicating the fiscal year of the data. Thank you in advance!

I've tried to get all the a href links but it gave me the typeerror below

import requests
from bs4 import BeautifulSoup

web_url = "https://www.sba.gov/document/report-sba-disaster-loan-data"
html = requests.get(web_url).content
soup = BeautifulSoup(html, 'html.parser')

tables = soup.find('div', {'class': 'jHSEzIJePQkATFwBbUD8j'})
for table in tables:
    link = cols[1].find('a').get('href')
    print(link)

TypeError: 'NoneType' object is not iterable

答案1

得分: 2

以下是翻译好的部分:

"file URLs are loaded from external address via JavaScript." - 文件URL是通过JavaScript从外部地址加载的。

"To get the .xlsx URLs you can use this example:" - 要获取.xlsx的URL,您可以使用以下示例:

"Prints:" - 打印:

"To get a pandas dataframe you can store the dates and urls into a list and use e.g. pandas.read_excel function." - 要获取一个pandas数据框,您可以将日期和URL存储到列表中,然后使用例如pandas.read_excel函数。

英文:

The file URLs are loaded from external address via JavaScript. To get the .xlsx URLs you can use this example:

import re
import requests

url = 'https://www.sba.gov/document/report-sba-disaster-loan-data'
api_url = 'https://www.sba.gov/api/content/{node_id}.json'

html_doc = requests.get(url).text
node_id = re.search(r'nodeId = "(\d+)"', html_doc).group(1)

data = requests.get(api_url.format(node_id=node_id)).json()

for f in data['files']:
    print(f['effectiveDate'], 'https://www.sba.gov' + f['fileUrl'])

Prints:

2022-02-11 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY21.xlsx
2021-03-15 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY20.xlsx
2020-04-10 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY19.xlsx
2019-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY18.xlsx
2018-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY17_Update_033118.xlsx
2017-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY16.xlsx
2016-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY15.xlsx
2015-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY14.xlsx
2014-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY13.xlsx
2014-09-23 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_Superstorm_Sandy.xlsx
2013-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY12.xlsx
2012-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY11.xlsx
2011-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY10.xlsx
2010-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY09.xlsx
2009-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY08.xlsx
2008-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY07.xlsx
2007-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY06.xlsx
2006-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY05.xlsx
2005-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY04.xlsx
2004-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY03.xls
2003-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY02.xls
2002-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY01.xls
2001-10-01 https://www.sba.gov/sites/default/files/2021-05/SBA_Disaster_Loan_Data_FY00.xlsx

To get a pandas dataframe you can store the dates and urls into a list and use e.g. pandas.read_excel function.

huangapple
  • 本文由 发表于 2023年3月12日 07:36:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75710240.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定