Web Scraping Yahoo Finance Python

huangapple go评论71阅读模式
英文:

Web Scraping Yahoo Finance Python

问题

我正在尝试收集Yahoo财务数据,以便从利润表、资产负债表和现金流量报表中获取给定股票代码的DataFrame。(以下是提供的URL)

我使用了来自https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan 的此函数,但它只对股票代码“AAPL”有效,对其他股票代码无效。

我想要一个更强大的网络爬取工具,可以适用于任何股票代码,并能在不做太多修改的情况下获取这三份报告。

我计划为每个报告编写单独的函数。

import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup

def retrieve_balance_sheet(ticker):
    ticker = ticker.upper()

    url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"
    header = {'Connection': 'keep-alive',
                'Expires': '-1',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
                AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
                }
        
    r = requests.get(url, headers=header)
    html = r.text
    soup = BeautifulSoup(html, "html.parser")

    div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
    if len(div) < 1:
        print("Fail to retrieve table column header")
        exit(0)

    col = []
    for h in div[0].find_all('span'):
        text = h.get_text()
        if text != "Breakdown":
            col.append( datetime.strptime(text, "%m/%d/%Y") )
    
    df = pd.DataFrame(columns=col)
    for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
        i = 0
        idx = ""
        val = []
        for h in div.find_all('span') :
            if i == 0:
                idx = h.get_text()
            else:
                num = int(h.get_text().replace(",", "")) * 1000
                val.append( num )
            i += 1
        row = pd.DataFrame([val], columns=col, index=[idx] )
        df = pd.concat([df, row])

    return df
英文:

I am trying to gather yahoo finance data for a given ticker symbol in a Dataframe from the income statement, balance sheet, and cash flow reports.(URL's provided below)

I used this function from https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan but it only worked for ticker "AAPL" and nothing else.

                                          2022-09-30    2021-09-30    2020-09-30    2019-09-30
Total Assets                             352755000000  351002000000  323888000000  338516000000
Total Liabilities Net Minority Interest  302083000000  287912000000  258549000000  248028000000
Total Equity Gross Minority Interest      50672000000   63090000000   65339000000   90488000000
Total Capitalization                     149631000000  172196000000  164006000000  182295000000
Common Stock Equity                       50672000000   63090000000   65339000000   90488000000
Net Tangible Assets                       50672000000   63090000000   65339000000   90488000000
Working Capital                          -18577000000    9355000000   38321000000   57101000000
Invested Capital                         170741000000  187809000000  177775000000  198535000000
Tangible Book Value                       50672000000   63090000000   65339000000   90488000000
Total Debt                               120069000000  124719000000  112436000000  108047000000
Net Debt                                  96423000000   89779000000   74420000000   59203000000
Share Issued                              15943425000   16426786000   16976763000   17772944000
Ordinary Shares Number                    15943425000   16426786000   16976763000   17772944000

I would like a more robust web scraper that will work for any ticker and be able to get all 3 of those reports without much modification.

I plan to have a separate functions for each one

import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup

def retrieve_balance_sheet(ticker):
    ticker = ticker.upper()

    url = f&quot;https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}&quot;
    header = {&#39;Connection&#39;: &#39;keep-alive&#39;,
                &#39;Expires&#39;: &#39;-1&#39;,
                &#39;Upgrade-Insecure-Requests&#39;: &#39;1&#39;,
                &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; WOW64) \
                AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36&#39;
                }
        
    r = requests.get(url, headers=header)
    html = r.text
    soup = BeautifulSoup(html, &quot;html.parser&quot;)

    div = soup.find_all(&#39;div&#39;, attrs={&#39;class&#39;: &#39;D(tbhg)&#39;})
    if len(div) &lt; 1:
        print(&quot;Fail to retrieve table column header&quot;)
        exit(0)

    col = []
    for h in div[0].find_all(&#39;span&#39;):
        text = h.get_text()
        if text != &quot;Breakdown&quot;:
            col.append( datetime.strptime(text, &quot;%m/%d/%Y&quot;) )
    
    df = pd.DataFrame(columns=col)
    for div in soup.find_all(&#39;div&#39;, attrs={&#39;data-test&#39;: &#39;fin-row&#39;}):
        i = 0
        idx = &quot;&quot;
        val = []
        for h in div.find_all(&#39;span&#39;) :
            if i == 0:
                idx = h.get_text()
            else:
                num = int(h.get_text().replace(&quot;,&quot;, &quot;&quot;)) * 1000
                val.append( num )
            i += 1
        row = pd.DataFrame([val], columns=col, index=[idx] )
        df = pd.concat([df, row])

    return df

答案1

得分: 1

以下是翻译好的代码部分:

import requests
import pandas as pd
from bs4 import BeautifulSoup

urls = [
    'https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}',
    'https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}',
    'https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}'
]

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}

def get_soup(url):
    r = requests.get(url, headers=headers)
    return BeautifulSoup(r.content, 'html.parser')

ticker = 'AMZN'
for url in urls:
    soup = get_soup(url.format(ticker=ticker))

    table = soup.select_one('.BdT')
    all_data = []
    for row in table.select('.D\(tbr\)'):
        data = [cell.text for cell in row.select('.Ta\(c\), .Ta\(start\)')]
        all_data.append(data)

    df = pd.DataFrame(all_data[1:], columns=all_data[0])
    print(df)
    print()

如果您需要其他帮助,请随时提出。

英文:

To get the tables from the 3 URLs you can try:

import requests
import pandas as pd
from bs4 import BeautifulSoup

urls = [
    &#39;https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}&#39;,
    &#39;https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}&#39;,
    &#39;https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}&#39;
]

headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0&#39;}

def get_soup(url):
    r = requests.get(url, headers=headers)
    return BeautifulSoup(r.content, &#39;html.parser&#39;)

ticker = &#39;AMZN&#39;
for url in urls:
    soup = get_soup(url.format(ticker=ticker))

    table = soup.select_one(&#39;.BdT&#39;)
    all_data = []
    for row in table.select(&#39;.D\(tbr\)&#39;):
        data = [cell.text for cell in row.select(&#39;.Ta\(c\), .Ta\(start\)&#39;)]
        all_data.append(data)

    df = pd.DataFrame(all_data[1:], columns=all_data[0])
    print(df)
    print()

Prints:

                                                     Breakdown          ttm   12/31/2022   12/31/2021   12/31/2020   12/31/2019
0                                                Total Revenue  513,983,000  513,983,000  469,822,000  386,064,000  280,522,000
1                                              Cost of Revenue  446,343,000  446,343,000  403,507,000  334,564,000  241,699,000
2                                                 Gross Profit   67,640,000   67,640,000   66,315,000   51,500,000   38,823,000
3                                            Operating Expense   55,392,000   55,392,000   41,436,000   28,601,000   24,282,000
4                                             Operating Income   12,248,000   12,248,000   24,879,000   22,899,000   14,541,000
5                    Net Non Operating Interest Income Expense   -1,378,000   -1,378,000   -1,361,000   -1,092,000     -768,000
6                                         Other Income Expense  -16,806,000  -16,806,000   14,633,000    2,371,000      203,000
7                                                Pretax Income   -5,936,000   -5,936,000   38,151,000   24,178,000   13,976,000
8                                                Tax Provision   -3,217,000   -3,217,000    4,791,000    2,863,000    2,374,000
9                     Earnings from Equity Interest Net of Tax       -3,000       -3,000        4,000       16,000      -14,000
10                              Net Income Common Stockholders   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
11                    Diluted NI Available to Com Stockholders   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
12                                                   Basic EPS            -        -0.27         3.30         2.13         1.17
13                                                 Diluted EPS            -        -0.27         3.24         2.09         1.15
14                                        Basic Average Shares            -   10,189,000   10,120,000   10,000,000    9,880,000
15                                      Diluted Average Shares            -   10,189,000   10,300,000   10,200,000   10,080,000
16                          Total Operating Income as Reported   12,248,000   12,248,000   24,879,000   22,899,000   14,541,000
17                                              Total Expenses  501,735,000  501,735,000  444,943,000  363,165,000  265,981,000
18         Net Income from Continuing &amp; Discontinued Operation   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
19                                           Normalized Income    7,037,600   -2,722,000   33,364,000   21,331,000   11,588,000
20                                             Interest Income      989,000      989,000      448,000      555,000      832,000
21                                            Interest Expense    2,367,000    2,367,000    1,809,000    1,647,000    1,600,000
22                                         Net Interest Income   -1,378,000   -1,378,000   -1,361,000   -1,092,000     -768,000
23                                                        EBIT   -3,569,000   -3,569,000   39,960,000   25,825,000   15,576,000
24                                                      EBITDA   38,352,000            -            -            -            -
25                                  Reconciled Cost of Revenue  446,343,000  446,343,000  403,507,000  334,564,000  241,699,000
26                                     Reconciled Depreciation   41,921,000   41,921,000   34,296,000   25,251,000   21,789,000
27  Net Income from Continuing Operation Net Minority Interest   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
28                      Total Unusual Items Excluding Goodwill  -16,266,000  -16,266,000   14,652,000            -      203,000
29                                         Total Unusual Items  -16,266,000  -16,266,000   14,652,000            -      203,000
30                                           Normalized EBITDA   54,618,000   38,352,000   74,256,000   51,076,000   37,365,000
31                                          Tax Rate for Calcs            0            0            0            0            0
32                                 Tax Effect of Unusual Items   -6,506,400            0            0            0            0

                                  Breakdown   12/31/2022   12/31/2021   12/31/2020   12/31/2019
0                              Total Assets  462,675,000  420,549,000  321,195,000  225,248,000
1   Total Liabilities Net Minority Interest  316,632,000  282,304,000  227,791,000  163,188,000
2      Total Equity Gross Minority Interest  146,043,000  138,245,000   93,404,000   62,060,000
3                      Total Capitalization  213,193,000  186,989,000  125,220,000   85,474,000
4                       Common Stock Equity  146,043,000  138,245,000   93,404,000   62,060,000
5                 Capital Lease Obligations   72,968,000   67,651,000   52,573,000   39,791,000
6                       Net Tangible Assets  125,755,000  122,874,000   78,387,000   47,306,000
7                           Working Capital   -8,602,000   19,314,000    6,348,000    8,522,000
8                          Invested Capital  213,193,000  186,989,000  125,220,000   85,474,000
9                       Tangible Book Value  125,755,000  122,874,000   78,387,000   47,306,000
10                               Total Debt  140,118,000  116,395,000   84,389,000   63,205,000
11                                 Net Debt   13,262,000   12,524,000            -            -
12                             Share Issued   10,757,000   10,640,000   10,540,000   10,420,000
13                   Ordinary Shares Number   10,242,000   10,180,000   10,060,000    9,960,000
14                   Treasury Shares Number      515,000      460,000      480,000      460,000

                            Breakdown          ttm   12/31/2022   12/31/2021   12/31/2020   12/31/2019
0                 Operating Cash Flow   46,752,000   46,752,000   46,327,000   66,064,000   38,514,000
1                 Investing Cash Flow  -37,601,000  -37,601,000  -58,154,000  -59,611,000  -24,281,000
2                 Financing Cash Flow    9,718,000    9,718,000    6,291,000   -1,104,000  -10,066,000
3                   End Cash Position   54,253,000   54,253,000   36,477,000   42,377,000   36,410,000
4   Income Tax Paid Supplemental Data    6,035,000    6,035,000    3,688,000    1,713,000      881,000
5     Interest Paid Supplemental Data    2,142,000    2,142,000    1,772,000    1,630,000    1,561,000
6                 Capital Expenditure  -63,645,000  -63,645,000  -61,053,000  -40,140,000  -16,861,000
7                    Issuance of Debt   62,719,000   62,719,000   26,959,000   17,321,000    2,273,000
8                   Repayment of Debt  -47,001,000  -46,753,000  -20,668,000  -18,425,000  -12,339,000
9         Repurchase of Capital Stock   -6,000,000   -6,000,000            -            -            -
10                     Free Cash Flow  -16,893,000  -16,893,000  -14,726,000   25,924,000   21,653,000

huangapple
  • 本文由 发表于 2023年6月13日 03:15:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459687.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定