2023年6月13日 03:15:31go评论102阅读模式

英文:

Web Scraping Yahoo Finance Python

问题

我正在尝试收集Yahoo财务数据，以便从利润表、资产负债表和现金流量报表中获取给定股票代码的DataFrame。（以下是提供的URL）

我使用了来自https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan 的此函数，但它只对股票代码“AAPL”有效，对其他股票代码无效。

我想要一个更强大的网络爬取工具，可以适用于任何股票代码，并能在不做太多修改的情况下获取这三份报告。

我计划为每个报告编写单独的函数。

import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def retrieve_balance_sheet(ticker):
    ticker = ticker.upper()
    url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"
    header = {'Connection': 'keep-alive',
                'Expires': '-1',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
                AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
                }
        
    r = requests.get(url, headers=header)
    html = r.text
    soup = BeautifulSoup(html, "html.parser")
    div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
    if len(div) < 1:
        print("Fail to retrieve table column header")
        exit(0)
    col = []
    for h in div[0].find_all('span'):
        text = h.get_text()
        if text != "Breakdown":
            col.append( datetime.strptime(text, "%m/%d/%Y") )
    
    df = pd.DataFrame(columns=col)
    for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
        i = 0
        idx = ""
        val = []
        for h in div.find_all('span') :
            if i == 0:
                idx = h.get_text()
            else:
                num = int(h.get_text().replace(",", "")) * 1000
                val.append( num )
            i += 1
        row = pd.DataFrame([val], columns=col, index=[idx] )
        df = pd.concat([df, row])
    return df

英文:

I am trying to gather yahoo finance data for a given ticker symbol in a Dataframe from the income statement, balance sheet, and cash flow reports.(URL's provided below)

I used this function from https://stackoverflow.com/questions/70090315/balance-sheet-from-using-yfinance-does-not-have-total-debt-like-on-yahoo-finan but it only worked for ticker "AAPL" and nothing else.

                                          2022-09-30    2021-09-30    2020-09-30    2019-09-30
Total Assets                             352755000000  351002000000  323888000000  338516000000
Total Liabilities Net Minority Interest  302083000000  287912000000  258549000000  248028000000
Total Equity Gross Minority Interest      50672000000   63090000000   65339000000   90488000000
Total Capitalization                     149631000000  172196000000  164006000000  182295000000
Common Stock Equity                       50672000000   63090000000   65339000000   90488000000
Net Tangible Assets                       50672000000   63090000000   65339000000   90488000000
Working Capital                          -18577000000    9355000000   38321000000   57101000000
Invested Capital                         170741000000  187809000000  177775000000  198535000000
Tangible Book Value                       50672000000   63090000000   65339000000   90488000000
Total Debt                               120069000000  124719000000  112436000000  108047000000
Net Debt                                  96423000000   89779000000   74420000000   59203000000
Share Issued                              15943425000   16426786000   16976763000   17772944000
Ordinary Shares Number                    15943425000   16426786000   16976763000   17772944000

I would like a more robust web scraper that will work for any ticker and be able to get all 3 of those reports without much modification.

I plan to have a separate functions for each one

import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def retrieve_balance_sheet(ticker):
    ticker = ticker.upper()
    url = f&quot;https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}&quot;
    header = {&#39;Connection&#39;: &#39;keep-alive&#39;,
                &#39;Expires&#39;: &#39;-1&#39;,
                &#39;Upgrade-Insecure-Requests&#39;: &#39;1&#39;,
                &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; WOW64) \
                AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36&#39;
                }
        
    r = requests.get(url, headers=header)
    html = r.text
    soup = BeautifulSoup(html, &quot;html.parser&quot;)
    div = soup.find_all(&#39;div&#39;, attrs={&#39;class&#39;: &#39;D(tbhg)&#39;})
    if len(div) &lt; 1:
        print(&quot;Fail to retrieve table column header&quot;)
        exit(0)
    col = []
    for h in div[0].find_all(&#39;span&#39;):
        text = h.get_text()
        if text != &quot;Breakdown&quot;:
            col.append( datetime.strptime(text, &quot;%m/%d/%Y&quot;) )
    
    df = pd.DataFrame(columns=col)
    for div in soup.find_all(&#39;div&#39;, attrs={&#39;data-test&#39;: &#39;fin-row&#39;}):
        i = 0
        idx = &quot;&quot;
        val = []
        for h in div.find_all(&#39;span&#39;) :
            if i == 0:
                idx = h.get_text()
            else:
                num = int(h.get_text().replace(&quot;,&quot;, &quot;&quot;)) * 1000
                val.append( num )
            i += 1
        row = pd.DataFrame([val], columns=col, index=[idx] )
        df = pd.concat([df, row])
    return df

答案1

得分: 1

以下是翻译好的代码部分：

import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
    'https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}',
    'https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}',
    'https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}'
]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
def get_soup(url):
    r = requests.get(url, headers=headers)
    return BeautifulSoup(r.content, 'html.parser')
ticker = 'AMZN'
for url in urls:
    soup = get_soup(url.format(ticker=ticker))
    table = soup.select_one('.BdT')
    all_data = []
    for row in table.select('.D\(tbr\)'):
        data = [cell.text for cell in row.select('.Ta\(c\), .Ta\(start\)')]
        all_data.append(data)
    df = pd.DataFrame(all_data[1:], columns=all_data[0])
    print(df)
    print()

如果您需要其他帮助，请随时提出。

英文:

To get the tables from the 3 URLs you can try:

import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
    &#39;https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}&#39;,
    &#39;https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}&#39;,
    &#39;https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}&#39;
]
headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0&#39;}
def get_soup(url):
    r = requests.get(url, headers=headers)
    return BeautifulSoup(r.content, &#39;html.parser&#39;)
ticker = &#39;AMZN&#39;
for url in urls:
    soup = get_soup(url.format(ticker=ticker))
    table = soup.select_one(&#39;.BdT&#39;)
    all_data = []
    for row in table.select(&#39;.D\(tbr\)&#39;):
        data = [cell.text for cell in row.select(&#39;.Ta\(c\), .Ta\(start\)&#39;)]
        all_data.append(data)
    df = pd.DataFrame(all_data[1:], columns=all_data[0])
    print(df)
    print()

Prints:

                                                     Breakdown          ttm   12/31/2022   12/31/2021   12/31/2020   12/31/2019
0                                                Total Revenue  513,983,000  513,983,000  469,822,000  386,064,000  280,522,000
1                                              Cost of Revenue  446,343,000  446,343,000  403,507,000  334,564,000  241,699,000
2                                                 Gross Profit   67,640,000   67,640,000   66,315,000   51,500,000   38,823,000
3                                            Operating Expense   55,392,000   55,392,000   41,436,000   28,601,000   24,282,000
4                                             Operating Income   12,248,000   12,248,000   24,879,000   22,899,000   14,541,000
5                    Net Non Operating Interest Income Expense   -1,378,000   -1,378,000   -1,361,000   -1,092,000     -768,000
6                                         Other Income Expense  -16,806,000  -16,806,000   14,633,000    2,371,000      203,000
7                                                Pretax Income   -5,936,000   -5,936,000   38,151,000   24,178,000   13,976,000
8                                                Tax Provision   -3,217,000   -3,217,000    4,791,000    2,863,000    2,374,000
9                     Earnings from Equity Interest Net of Tax       -3,000       -3,000        4,000       16,000      -14,000
10                              Net Income Common Stockholders   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
11                    Diluted NI Available to Com Stockholders   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
12                                                   Basic EPS            -        -0.27         3.30         2.13         1.17
13                                                 Diluted EPS            -        -0.27         3.24         2.09         1.15
14                                        Basic Average Shares            -   10,189,000   10,120,000   10,000,000    9,880,000
15                                      Diluted Average Shares            -   10,189,000   10,300,000   10,200,000   10,080,000
16                          Total Operating Income as Reported   12,248,000   12,248,000   24,879,000   22,899,000   14,541,000
17                                              Total Expenses  501,735,000  501,735,000  444,943,000  363,165,000  265,981,000
18         Net Income from Continuing &amp; Discontinued Operation   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
19                                           Normalized Income    7,037,600   -2,722,000   33,364,000   21,331,000   11,588,000
20                                             Interest Income      989,000      989,000      448,000      555,000      832,000
21                                            Interest Expense    2,367,000    2,367,000    1,809,000    1,647,000    1,600,000
22                                         Net Interest Income   -1,378,000   -1,378,000   -1,361,000   -1,092,000     -768,000
23                                                        EBIT   -3,569,000   -3,569,000   39,960,000   25,825,000   15,576,000
24                                                      EBITDA   38,352,000            -            -            -            -
25                                  Reconciled Cost of Revenue  446,343,000  446,343,000  403,507,000  334,564,000  241,699,000
26                                     Reconciled Depreciation   41,921,000   41,921,000   34,296,000   25,251,000   21,789,000
27  Net Income from Continuing Operation Net Minority Interest   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
28                      Total Unusual Items Excluding Goodwill  -16,266,000  -16,266,000   14,652,000            -      203,000
29                                         Total Unusual Items  -16,266,000  -16,266,000   14,652,000            -      203,000
30                                           Normalized EBITDA   54,618,000   38,352,000   74,256,000   51,076,000   37,365,000
31                                          Tax Rate for Calcs            0            0            0            0            0
32                                 Tax Effect of Unusual Items   -6,506,400            0            0            0            0
                                  Breakdown   12/31/2022   12/31/2021   12/31/2020   12/31/2019
0                              Total Assets  462,675,000  420,549,000  321,195,000  225,248,000
1   Total Liabilities Net Minority Interest  316,632,000  282,304,000  227,791,000  163,188,000
2      Total Equity Gross Minority Interest  146,043,000  138,245,000   93,404,000   62,060,000
3                      Total Capitalization  213,193,000  186,989,000  125,220,000   85,474,000
4                       Common Stock Equity  146,043,000  138,245,000   93,404,000   62,060,000
5                 Capital Lease Obligations   72,968,000   67,651,000   52,573,000   39,791,000
6                       Net Tangible Assets  125,755,000  122,874,000   78,387,000   47,306,000
7                           Working Capital   -8,602,000   19,314,000    6,348,000    8,522,000
8                          Invested Capital  213,193,000  186,989,000  125,220,000   85,474,000
9                       Tangible Book Value  125,755,000  122,874,000   78,387,000   47,306,000
10                               Total Debt  140,118,000  116,395,000   84,389,000   63,205,000
11                                 Net Debt   13,262,000   12,524,000            -            -
12                             Share Issued   10,757,000   10,640,000   10,540,000   10,420,000
13                   Ordinary Shares Number   10,242,000   10,180,000   10,060,000    9,960,000
14                   Treasury Shares Number      515,000      460,000      480,000      460,000
                            Breakdown          ttm   12/31/2022   12/31/2021   12/31/2020   12/31/2019
0                 Operating Cash Flow   46,752,000   46,752,000   46,327,000   66,064,000   38,514,000
1                 Investing Cash Flow  -37,601,000  -37,601,000  -58,154,000  -59,611,000  -24,281,000
2                 Financing Cash Flow    9,718,000    9,718,000    6,291,000   -1,104,000  -10,066,000
3                   End Cash Position   54,253,000   54,253,000   36,477,000   42,377,000   36,410,000
4   Income Tax Paid Supplemental Data    6,035,000    6,035,000    3,688,000    1,713,000      881,000
5     Interest Paid Supplemental Data    2,142,000    2,142,000    1,772,000    1,630,000    1,561,000
6                 Capital Expenditure  -63,645,000  -63,645,000  -61,053,000  -40,140,000  -16,861,000
7                    Issuance of Debt   62,719,000   62,719,000   26,959,000   17,321,000    2,273,000
8                   Repayment of Debt  -47,001,000  -46,753,000  -20,668,000  -18,425,000  -12,339,000
9         Repurchase of Capital Stock   -6,000,000   -6,000,000            -            -            -
10                     Free Cash Flow  -16,893,000  -16,893,000  -14,726,000   25,924,000   21,653,000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Web Scraping Yahoo Finance Python

问题

答案1

以不同的帧率呈现图像，而不是游戏窗口。

打印小于给定最大值的某个数字的倍数

检查Selenium/Python中具有ID的span元素中的文本。

VGG16迁移学习 – 未知指标函数：f1_score错误

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。