英文:
Web Scraping Yahoo Finance Python
问题
我正在尝试收集Yahoo财务数据,以便从利润表、资产负债表和现金流量报表中获取给定股票代码的DataFrame。(以下是提供的URL)
我使用了来自https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan 的此函数,但它只对股票代码“AAPL”有效,对其他股票代码无效。
我想要一个更强大的网络爬取工具,可以适用于任何股票代码,并能在不做太多修改的情况下获取这三份报告。
我计划为每个报告编写单独的函数。
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def retrieve_balance_sheet(ticker):
ticker = ticker.upper()
url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"
header = {'Connection': 'keep-alive',
'Expires': '-1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
r = requests.get(url, headers=header)
html = r.text
soup = BeautifulSoup(html, "html.parser")
div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
if len(div) < 1:
print("Fail to retrieve table column header")
exit(0)
col = []
for h in div[0].find_all('span'):
text = h.get_text()
if text != "Breakdown":
col.append( datetime.strptime(text, "%m/%d/%Y") )
df = pd.DataFrame(columns=col)
for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
i = 0
idx = ""
val = []
for h in div.find_all('span') :
if i == 0:
idx = h.get_text()
else:
num = int(h.get_text().replace(",", "")) * 1000
val.append( num )
i += 1
row = pd.DataFrame([val], columns=col, index=[idx] )
df = pd.concat([df, row])
return df
- https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}
- https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}
- https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}
英文:
I am trying to gather yahoo finance data for a given ticker symbol in a Dataframe from the income statement, balance sheet, and cash flow reports.(URL's provided below)
I used this function from https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan but it only worked for ticker "AAPL" and nothing else.
2022-09-30 2021-09-30 2020-09-30 2019-09-30
Total Assets 352755000000 351002000000 323888000000 338516000000
Total Liabilities Net Minority Interest 302083000000 287912000000 258549000000 248028000000
Total Equity Gross Minority Interest 50672000000 63090000000 65339000000 90488000000
Total Capitalization 149631000000 172196000000 164006000000 182295000000
Common Stock Equity 50672000000 63090000000 65339000000 90488000000
Net Tangible Assets 50672000000 63090000000 65339000000 90488000000
Working Capital -18577000000 9355000000 38321000000 57101000000
Invested Capital 170741000000 187809000000 177775000000 198535000000
Tangible Book Value 50672000000 63090000000 65339000000 90488000000
Total Debt 120069000000 124719000000 112436000000 108047000000
Net Debt 96423000000 89779000000 74420000000 59203000000
Share Issued 15943425000 16426786000 16976763000 17772944000
Ordinary Shares Number 15943425000 16426786000 16976763000 17772944000
I would like a more robust web scraper that will work for any ticker and be able to get all 3 of those reports without much modification.
I plan to have a separate functions for each one
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def retrieve_balance_sheet(ticker):
ticker = ticker.upper()
url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"
header = {'Connection': 'keep-alive',
'Expires': '-1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
r = requests.get(url, headers=header)
html = r.text
soup = BeautifulSoup(html, "html.parser")
div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
if len(div) < 1:
print("Fail to retrieve table column header")
exit(0)
col = []
for h in div[0].find_all('span'):
text = h.get_text()
if text != "Breakdown":
col.append( datetime.strptime(text, "%m/%d/%Y") )
df = pd.DataFrame(columns=col)
for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
i = 0
idx = ""
val = []
for h in div.find_all('span') :
if i == 0:
idx = h.get_text()
else:
num = int(h.get_text().replace(",", "")) * 1000
val.append( num )
i += 1
row = pd.DataFrame([val], columns=col, index=[idx] )
df = pd.concat([df, row])
return df
答案1
得分: 1
以下是翻译好的代码部分:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
'https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}',
'https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}',
'https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}'
]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
def get_soup(url):
r = requests.get(url, headers=headers)
return BeautifulSoup(r.content, 'html.parser')
ticker = 'AMZN'
for url in urls:
soup = get_soup(url.format(ticker=ticker))
table = soup.select_one('.BdT')
all_data = []
for row in table.select('.D\(tbr\)'):
data = [cell.text for cell in row.select('.Ta\(c\), .Ta\(start\)')]
all_data.append(data)
df = pd.DataFrame(all_data[1:], columns=all_data[0])
print(df)
print()
如果您需要其他帮助,请随时提出。
英文:
To get the tables from the 3 URLs you can try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
'https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}',
'https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}',
'https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}'
]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
def get_soup(url):
r = requests.get(url, headers=headers)
return BeautifulSoup(r.content, 'html.parser')
ticker = 'AMZN'
for url in urls:
soup = get_soup(url.format(ticker=ticker))
table = soup.select_one('.BdT')
all_data = []
for row in table.select('.D\(tbr\)'):
data = [cell.text for cell in row.select('.Ta\(c\), .Ta\(start\)')]
all_data.append(data)
df = pd.DataFrame(all_data[1:], columns=all_data[0])
print(df)
print()
Prints:
Breakdown ttm 12/31/2022 12/31/2021 12/31/2020 12/31/2019
0 Total Revenue 513,983,000 513,983,000 469,822,000 386,064,000 280,522,000
1 Cost of Revenue 446,343,000 446,343,000 403,507,000 334,564,000 241,699,000
2 Gross Profit 67,640,000 67,640,000 66,315,000 51,500,000 38,823,000
3 Operating Expense 55,392,000 55,392,000 41,436,000 28,601,000 24,282,000
4 Operating Income 12,248,000 12,248,000 24,879,000 22,899,000 14,541,000
5 Net Non Operating Interest Income Expense -1,378,000 -1,378,000 -1,361,000 -1,092,000 -768,000
6 Other Income Expense -16,806,000 -16,806,000 14,633,000 2,371,000 203,000
7 Pretax Income -5,936,000 -5,936,000 38,151,000 24,178,000 13,976,000
8 Tax Provision -3,217,000 -3,217,000 4,791,000 2,863,000 2,374,000
9 Earnings from Equity Interest Net of Tax -3,000 -3,000 4,000 16,000 -14,000
10 Net Income Common Stockholders -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
11 Diluted NI Available to Com Stockholders -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
12 Basic EPS - -0.27 3.30 2.13 1.17
13 Diluted EPS - -0.27 3.24 2.09 1.15
14 Basic Average Shares - 10,189,000 10,120,000 10,000,000 9,880,000
15 Diluted Average Shares - 10,189,000 10,300,000 10,200,000 10,080,000
16 Total Operating Income as Reported 12,248,000 12,248,000 24,879,000 22,899,000 14,541,000
17 Total Expenses 501,735,000 501,735,000 444,943,000 363,165,000 265,981,000
18 Net Income from Continuing & Discontinued Operation -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
19 Normalized Income 7,037,600 -2,722,000 33,364,000 21,331,000 11,588,000
20 Interest Income 989,000 989,000 448,000 555,000 832,000
21 Interest Expense 2,367,000 2,367,000 1,809,000 1,647,000 1,600,000
22 Net Interest Income -1,378,000 -1,378,000 -1,361,000 -1,092,000 -768,000
23 EBIT -3,569,000 -3,569,000 39,960,000 25,825,000 15,576,000
24 EBITDA 38,352,000 - - - -
25 Reconciled Cost of Revenue 446,343,000 446,343,000 403,507,000 334,564,000 241,699,000
26 Reconciled Depreciation 41,921,000 41,921,000 34,296,000 25,251,000 21,789,000
27 Net Income from Continuing Operation Net Minority Interest -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
28 Total Unusual Items Excluding Goodwill -16,266,000 -16,266,000 14,652,000 - 203,000
29 Total Unusual Items -16,266,000 -16,266,000 14,652,000 - 203,000
30 Normalized EBITDA 54,618,000 38,352,000 74,256,000 51,076,000 37,365,000
31 Tax Rate for Calcs 0 0 0 0 0
32 Tax Effect of Unusual Items -6,506,400 0 0 0 0
Breakdown 12/31/2022 12/31/2021 12/31/2020 12/31/2019
0 Total Assets 462,675,000 420,549,000 321,195,000 225,248,000
1 Total Liabilities Net Minority Interest 316,632,000 282,304,000 227,791,000 163,188,000
2 Total Equity Gross Minority Interest 146,043,000 138,245,000 93,404,000 62,060,000
3 Total Capitalization 213,193,000 186,989,000 125,220,000 85,474,000
4 Common Stock Equity 146,043,000 138,245,000 93,404,000 62,060,000
5 Capital Lease Obligations 72,968,000 67,651,000 52,573,000 39,791,000
6 Net Tangible Assets 125,755,000 122,874,000 78,387,000 47,306,000
7 Working Capital -8,602,000 19,314,000 6,348,000 8,522,000
8 Invested Capital 213,193,000 186,989,000 125,220,000 85,474,000
9 Tangible Book Value 125,755,000 122,874,000 78,387,000 47,306,000
10 Total Debt 140,118,000 116,395,000 84,389,000 63,205,000
11 Net Debt 13,262,000 12,524,000 - -
12 Share Issued 10,757,000 10,640,000 10,540,000 10,420,000
13 Ordinary Shares Number 10,242,000 10,180,000 10,060,000 9,960,000
14 Treasury Shares Number 515,000 460,000 480,000 460,000
Breakdown ttm 12/31/2022 12/31/2021 12/31/2020 12/31/2019
0 Operating Cash Flow 46,752,000 46,752,000 46,327,000 66,064,000 38,514,000
1 Investing Cash Flow -37,601,000 -37,601,000 -58,154,000 -59,611,000 -24,281,000
2 Financing Cash Flow 9,718,000 9,718,000 6,291,000 -1,104,000 -10,066,000
3 End Cash Position 54,253,000 54,253,000 36,477,000 42,377,000 36,410,000
4 Income Tax Paid Supplemental Data 6,035,000 6,035,000 3,688,000 1,713,000 881,000
5 Interest Paid Supplemental Data 2,142,000 2,142,000 1,772,000 1,630,000 1,561,000
6 Capital Expenditure -63,645,000 -63,645,000 -61,053,000 -40,140,000 -16,861,000
7 Issuance of Debt 62,719,000 62,719,000 26,959,000 17,321,000 2,273,000
8 Repayment of Debt -47,001,000 -46,753,000 -20,668,000 -18,425,000 -12,339,000
9 Repurchase of Capital Stock -6,000,000 -6,000,000 - - -
10 Free Cash Flow -16,893,000 -16,893,000 -14,726,000 25,924,000 21,653,000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论