Web Scraping Yahoo Finance Python

huangapple go评论102阅读模式
英文:

Web Scraping Yahoo Finance Python

问题

我正在尝试收集Yahoo财务数据,以便从利润表、资产负债表和现金流量报表中获取给定股票代码的DataFrame。(以下是提供的URL)

我使用了来自https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan 的此函数,但它只对股票代码“AAPL”有效,对其他股票代码无效。

我想要一个更强大的网络爬取工具,可以适用于任何股票代码,并能在不做太多修改的情况下获取这三份报告。

我计划为每个报告编写单独的函数。

  1. import pandas as pd
  2. import requests
  3. from datetime import datetime
  4. from bs4 import BeautifulSoup
  5. def retrieve_balance_sheet(ticker):
  6. ticker = ticker.upper()
  7. url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"
  8. header = {'Connection': 'keep-alive',
  9. 'Expires': '-1',
  10. 'Upgrade-Insecure-Requests': '1',
  11. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
  12. AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  13. }
  14. r = requests.get(url, headers=header)
  15. html = r.text
  16. soup = BeautifulSoup(html, "html.parser")
  17. div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
  18. if len(div) < 1:
  19. print("Fail to retrieve table column header")
  20. exit(0)
  21. col = []
  22. for h in div[0].find_all('span'):
  23. text = h.get_text()
  24. if text != "Breakdown":
  25. col.append( datetime.strptime(text, "%m/%d/%Y") )
  26. df = pd.DataFrame(columns=col)
  27. for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
  28. i = 0
  29. idx = ""
  30. val = []
  31. for h in div.find_all('span') :
  32. if i == 0:
  33. idx = h.get_text()
  34. else:
  35. num = int(h.get_text().replace(",", "")) * 1000
  36. val.append( num )
  37. i += 1
  38. row = pd.DataFrame([val], columns=col, index=[idx] )
  39. df = pd.concat([df, row])
  40. return df
英文:

I am trying to gather yahoo finance data for a given ticker symbol in a Dataframe from the income statement, balance sheet, and cash flow reports.(URL's provided below)

I used this function from https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan but it only worked for ticker "AAPL" and nothing else.

  1. 2022-09-30 2021-09-30 2020-09-30 2019-09-30
  2. Total Assets 352755000000 351002000000 323888000000 338516000000
  3. Total Liabilities Net Minority Interest 302083000000 287912000000 258549000000 248028000000
  4. Total Equity Gross Minority Interest 50672000000 63090000000 65339000000 90488000000
  5. Total Capitalization 149631000000 172196000000 164006000000 182295000000
  6. Common Stock Equity 50672000000 63090000000 65339000000 90488000000
  7. Net Tangible Assets 50672000000 63090000000 65339000000 90488000000
  8. Working Capital -18577000000 9355000000 38321000000 57101000000
  9. Invested Capital 170741000000 187809000000 177775000000 198535000000
  10. Tangible Book Value 50672000000 63090000000 65339000000 90488000000
  11. Total Debt 120069000000 124719000000 112436000000 108047000000
  12. Net Debt 96423000000 89779000000 74420000000 59203000000
  13. Share Issued 15943425000 16426786000 16976763000 17772944000
  14. Ordinary Shares Number 15943425000 16426786000 16976763000 17772944000

I would like a more robust web scraper that will work for any ticker and be able to get all 3 of those reports without much modification.

I plan to have a separate functions for each one

  1. import pandas as pd
  2. import requests
  3. from datetime import datetime
  4. from bs4 import BeautifulSoup
  5. def retrieve_balance_sheet(ticker):
  6. ticker = ticker.upper()
  7. url = f&quot;https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}&quot;
  8. header = {&#39;Connection&#39;: &#39;keep-alive&#39;,
  9. &#39;Expires&#39;: &#39;-1&#39;,
  10. &#39;Upgrade-Insecure-Requests&#39;: &#39;1&#39;,
  11. &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; WOW64) \
  12. AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36&#39;
  13. }
  14. r = requests.get(url, headers=header)
  15. html = r.text
  16. soup = BeautifulSoup(html, &quot;html.parser&quot;)
  17. div = soup.find_all(&#39;div&#39;, attrs={&#39;class&#39;: &#39;D(tbhg)&#39;})
  18. if len(div) &lt; 1:
  19. print(&quot;Fail to retrieve table column header&quot;)
  20. exit(0)
  21. col = []
  22. for h in div[0].find_all(&#39;span&#39;):
  23. text = h.get_text()
  24. if text != &quot;Breakdown&quot;:
  25. col.append( datetime.strptime(text, &quot;%m/%d/%Y&quot;) )
  26. df = pd.DataFrame(columns=col)
  27. for div in soup.find_all(&#39;div&#39;, attrs={&#39;data-test&#39;: &#39;fin-row&#39;}):
  28. i = 0
  29. idx = &quot;&quot;
  30. val = []
  31. for h in div.find_all(&#39;span&#39;) :
  32. if i == 0:
  33. idx = h.get_text()
  34. else:
  35. num = int(h.get_text().replace(&quot;,&quot;, &quot;&quot;)) * 1000
  36. val.append( num )
  37. i += 1
  38. row = pd.DataFrame([val], columns=col, index=[idx] )
  39. df = pd.concat([df, row])
  40. return df

答案1

得分: 1

以下是翻译好的代码部分:

  1. import requests
  2. import pandas as pd
  3. from bs4 import BeautifulSoup
  4. urls = [
  5. 'https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}',
  6. 'https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}',
  7. 'https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}'
  8. ]
  9. headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
  10. def get_soup(url):
  11. r = requests.get(url, headers=headers)
  12. return BeautifulSoup(r.content, 'html.parser')
  13. ticker = 'AMZN'
  14. for url in urls:
  15. soup = get_soup(url.format(ticker=ticker))
  16. table = soup.select_one('.BdT')
  17. all_data = []
  18. for row in table.select('.D\(tbr\)'):
  19. data = [cell.text for cell in row.select('.Ta\(c\), .Ta\(start\)')]
  20. all_data.append(data)
  21. df = pd.DataFrame(all_data[1:], columns=all_data[0])
  22. print(df)
  23. print()

如果您需要其他帮助,请随时提出。

英文:

To get the tables from the 3 URLs you can try:

  1. import requests
  2. import pandas as pd
  3. from bs4 import BeautifulSoup
  4. urls = [
  5. &#39;https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}&#39;,
  6. &#39;https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}&#39;,
  7. &#39;https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}&#39;
  8. ]
  9. headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0&#39;}
  10. def get_soup(url):
  11. r = requests.get(url, headers=headers)
  12. return BeautifulSoup(r.content, &#39;html.parser&#39;)
  13. ticker = &#39;AMZN&#39;
  14. for url in urls:
  15. soup = get_soup(url.format(ticker=ticker))
  16. table = soup.select_one(&#39;.BdT&#39;)
  17. all_data = []
  18. for row in table.select(&#39;.D\(tbr\)&#39;):
  19. data = [cell.text for cell in row.select(&#39;.Ta\(c\), .Ta\(start\)&#39;)]
  20. all_data.append(data)
  21. df = pd.DataFrame(all_data[1:], columns=all_data[0])
  22. print(df)
  23. print()

Prints:

  1. Breakdown ttm 12/31/2022 12/31/2021 12/31/2020 12/31/2019
  2. 0 Total Revenue 513,983,000 513,983,000 469,822,000 386,064,000 280,522,000
  3. 1 Cost of Revenue 446,343,000 446,343,000 403,507,000 334,564,000 241,699,000
  4. 2 Gross Profit 67,640,000 67,640,000 66,315,000 51,500,000 38,823,000
  5. 3 Operating Expense 55,392,000 55,392,000 41,436,000 28,601,000 24,282,000
  6. 4 Operating Income 12,248,000 12,248,000 24,879,000 22,899,000 14,541,000
  7. 5 Net Non Operating Interest Income Expense -1,378,000 -1,378,000 -1,361,000 -1,092,000 -768,000
  8. 6 Other Income Expense -16,806,000 -16,806,000 14,633,000 2,371,000 203,000
  9. 7 Pretax Income -5,936,000 -5,936,000 38,151,000 24,178,000 13,976,000
  10. 8 Tax Provision -3,217,000 -3,217,000 4,791,000 2,863,000 2,374,000
  11. 9 Earnings from Equity Interest Net of Tax -3,000 -3,000 4,000 16,000 -14,000
  12. 10 Net Income Common Stockholders -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
  13. 11 Diluted NI Available to Com Stockholders -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
  14. 12 Basic EPS - -0.27 3.30 2.13 1.17
  15. 13 Diluted EPS - -0.27 3.24 2.09 1.15
  16. 14 Basic Average Shares - 10,189,000 10,120,000 10,000,000 9,880,000
  17. 15 Diluted Average Shares - 10,189,000 10,300,000 10,200,000 10,080,000
  18. 16 Total Operating Income as Reported 12,248,000 12,248,000 24,879,000 22,899,000 14,541,000
  19. 17 Total Expenses 501,735,000 501,735,000 444,943,000 363,165,000 265,981,000
  20. 18 Net Income from Continuing &amp; Discontinued Operation -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
  21. 19 Normalized Income 7,037,600 -2,722,000 33,364,000 21,331,000 11,588,000
  22. 20 Interest Income 989,000 989,000 448,000 555,000 832,000
  23. 21 Interest Expense 2,367,000 2,367,000 1,809,000 1,647,000 1,600,000
  24. 22 Net Interest Income -1,378,000 -1,378,000 -1,361,000 -1,092,000 -768,000
  25. 23 EBIT -3,569,000 -3,569,000 39,960,000 25,825,000 15,576,000
  26. 24 EBITDA 38,352,000 - - - -
  27. 25 Reconciled Cost of Revenue 446,343,000 446,343,000 403,507,000 334,564,000 241,699,000
  28. 26 Reconciled Depreciation 41,921,000 41,921,000 34,296,000 25,251,000 21,789,000
  29. 27 Net Income from Continuing Operation Net Minority Interest -2,722,000 -2,722,000 33,364,000 21,331,000 11,588,000
  30. 28 Total Unusual Items Excluding Goodwill -16,266,000 -16,266,000 14,652,000 - 203,000
  31. 29 Total Unusual Items -16,266,000 -16,266,000 14,652,000 - 203,000
  32. 30 Normalized EBITDA 54,618,000 38,352,000 74,256,000 51,076,000 37,365,000
  33. 31 Tax Rate for Calcs 0 0 0 0 0
  34. 32 Tax Effect of Unusual Items -6,506,400 0 0 0 0
  35. Breakdown 12/31/2022 12/31/2021 12/31/2020 12/31/2019
  36. 0 Total Assets 462,675,000 420,549,000 321,195,000 225,248,000
  37. 1 Total Liabilities Net Minority Interest 316,632,000 282,304,000 227,791,000 163,188,000
  38. 2 Total Equity Gross Minority Interest 146,043,000 138,245,000 93,404,000 62,060,000
  39. 3 Total Capitalization 213,193,000 186,989,000 125,220,000 85,474,000
  40. 4 Common Stock Equity 146,043,000 138,245,000 93,404,000 62,060,000
  41. 5 Capital Lease Obligations 72,968,000 67,651,000 52,573,000 39,791,000
  42. 6 Net Tangible Assets 125,755,000 122,874,000 78,387,000 47,306,000
  43. 7 Working Capital -8,602,000 19,314,000 6,348,000 8,522,000
  44. 8 Invested Capital 213,193,000 186,989,000 125,220,000 85,474,000
  45. 9 Tangible Book Value 125,755,000 122,874,000 78,387,000 47,306,000
  46. 10 Total Debt 140,118,000 116,395,000 84,389,000 63,205,000
  47. 11 Net Debt 13,262,000 12,524,000 - -
  48. 12 Share Issued 10,757,000 10,640,000 10,540,000 10,420,000
  49. 13 Ordinary Shares Number 10,242,000 10,180,000 10,060,000 9,960,000
  50. 14 Treasury Shares Number 515,000 460,000 480,000 460,000
  51. Breakdown ttm 12/31/2022 12/31/2021 12/31/2020 12/31/2019
  52. 0 Operating Cash Flow 46,752,000 46,752,000 46,327,000 66,064,000 38,514,000
  53. 1 Investing Cash Flow -37,601,000 -37,601,000 -58,154,000 -59,611,000 -24,281,000
  54. 2 Financing Cash Flow 9,718,000 9,718,000 6,291,000 -1,104,000 -10,066,000
  55. 3 End Cash Position 54,253,000 54,253,000 36,477,000 42,377,000 36,410,000
  56. 4 Income Tax Paid Supplemental Data 6,035,000 6,035,000 3,688,000 1,713,000 881,000
  57. 5 Interest Paid Supplemental Data 2,142,000 2,142,000 1,772,000 1,630,000 1,561,000
  58. 6 Capital Expenditure -63,645,000 -63,645,000 -61,053,000 -40,140,000 -16,861,000
  59. 7 Issuance of Debt 62,719,000 62,719,000 26,959,000 17,321,000 2,273,000
  60. 8 Repayment of Debt -47,001,000 -46,753,000 -20,668,000 -18,425,000 -12,339,000
  61. 9 Repurchase of Capital Stock -6,000,000 -6,000,000 - - -
  62. 10 Free Cash Flow -16,893,000 -16,893,000 -14,726,000 25,924,000 21,653,000

huangapple
  • 本文由 发表于 2023年6月13日 03:15:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459687.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定