Web从GuruFocus网站抓取表格数据

huangapple go评论72阅读模式
英文:

Web scrape to obtain table data from guru focus site

问题

我想从GuruFocus网站上抓取特定数据。
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO

目前我正在获取数字值。例如:财务实力值为“4”(满分为10)。现在我想获取子组件的数据。

仅获取数字值的代码部分:

for name in names:
    start_time = time.time()

    # 获取股票符号
    URL = f'https://www.gurufocus.com/search?s={name}'
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(URL)
    html_source = driver.page_source
    driver.close()

    soup = BeautifulSoup(html_source, 'html.parser')

    headers = soup.find_all("span")
    
    # 仅保存第一个链接
    for i, head in enumerate(headers):
        try:
            h = head.find("a").get("href")
            link = "https://www.gurufocus.com" + h
            break
        except:
            pass

    try:
        # 加载链接页面
        driver = webdriver.Chrome(ChromeDriverManager().install())
        driver.get(link)
        html_source = driver.page_source
        driver.close()

        soup = BeautifulSoup(html_source, 'html.parser')

        headers = soup.find_all("span", class_="t-default bold")
        ratings = []
        for head in headers:
            ratings.append(int(head.get_text()))
        if len(ratings) == 0:
            continue
        ratings_dict = {"Financial Strength": ratings[0],
                        "Growth Rank": ratings[1],
                        "Momentum Rank": ratings[2],
                        "Profitability Rank": ratings[3],
                        "GF Value Rank": ratings[4],
                       }
        print(ratings_dict)
        #     ratings_dict = json.loads(ratings_dict)
        with open(f"output/gurufocus/{name}.json", 'w') as f:
            json.dump(str(ratings_dict), f)
        end_time = time.time()
        print("time taken for %s is: %.2f" %(name, (end_time-start_time)))
    except:
        print("no data found")

输出:

{"Financial Strength": 6, "Growth Rank": 4, "Momentum Rank": 4, "Profitability Rank": 7, "GF Value Rank": 5}

期望:
我想获取完整的表格数据(如下图所示),以及排名,存储在数据框中。
Web从GuruFocus网站抓取表格数据

要获取其他特定数据,您需要更改代码如下:

英文:

I want to scrape specific data from guru focus website.
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO

Currently i am fetching number value. For example:financial strength value is "4" out of 10. Now i want to fetch sub components data as well.

code to fetch only number value:

for name in names:
    start_time = time.time()

    # getting the symbol
    URL = f'https://www.gurufocus.com/search?s={name}'
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(URL)
    html_source = driver.page_source
    driver.close()

    soup = BeautifulSoup(html_source, 'html.parser')

    headers = soup.find_all("span")
    
    # saving only the first link
    for i, head in enumerate(headers):
        try:
            h = head.find("a").get("href")
            link = "https://www.gurufocus.com" + h
            break
        except:
            pass

    try:
        # loading the link page
        driver = webdriver.Chrome(ChromeDriverManager().install())
        driver.get(link)
        html_source = driver.page_source
        driver.close()

        soup = BeautifulSoup(html_source, 'html.parser')

        headers = soup.find_all("span", class_="t-default bold")
        ratings = []
        for head in headers:
            ratings.append(int(head.get_text()))
        if len(ratings) == 0:
            continue
        ratings_dict = {"Financial Strength": ratings[0],
                        "Growth Rank"       : ratings[1],
                        "Momentum Rank"     : ratings[2],
                        "Profitability Rank": ratings[3],
                        "GF Value Rank"     : ratings[4],
                       }
        print(ratings_dict)
        #     ratings_dict = json.loads(ratings_dict)
        with open(f"output/gurufocus/{name}.json", 'w') as f:
            json.dump(str(ratings_dict), f)
        end_time = time.time()
        print("time taken for %s is: %.2f" %(name, (end_time-start_time)))
    except:
        print("no data found")

output:

"{'Financial Strength': 6, 'Growth Rank': 4, 'Momentum Rank': 4, 'Profitability Rank': 7, 'GF Value Rank': 5}"

Expection:
I want to fetch full table data( below image) along with rank into data frame.
Web从GuruFocus网站抓取表格数据

How do I need to change my code to obtain the other specific data?

答案1

得分: 4

以下是您要翻译的代码部分:

import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from collections import ChainMap

tables = pd.read_html(
    requests.get(
        'https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO'
    ).text,
    header=0
)
sub_table_values = [[{record["Name"]: record["Current"]} for record in json.loads(e)] for e in [i.to_json(orient="records") for i in tables]]
sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
print(json.dumps(sub_formatted, indent=4))

希望这能满足您的需求。如果您有任何其他问题,请随时告诉我。

英文:

You can use Pandas to write a clean solution for this problem:

import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from collections import ChainMap

tables = pd.read_html(
    requests.get(
        'https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO'
    ).text,
    header=0
)
sub_table_values = [[{record["Name"]: record["Current"]} for record in json.loads(e)] for e in [i.to_json(orient="records") for i in tables]]
sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
print(json.dumps(sub_formatted, indent=4))

Description:

  • First, I obtain all the tables and convert those to DataFrames(using pandas).
  • Then, I convert the dataframes to json and only extract the desired result(Name and Curent)
  • Format the data.

It would return:

[
    {
        "WACC vs ROIC": null,
        "Beneish M-Score": "-2.28",
        "Altman Z-Score": "1.98",
        "Piotroski F-Score": "7/9",
        "Interest Coverage": "4.68",
        "Debt-to-EBITDA": "2.55",
        "Debt-to-Equity": "0.85",
        "Equity-to-Asset": "0.37",
        "Cash-To-Debt": "0.1"
    },
    {
        "Future 3-5Y Total Revenue Growth Rate": 13.71,
        "3-Year Book Growth Rate": 2.8,
        "3-Year FCF Growth Rate": 49.9,
        "3-Year EPS without NRI Growth Rate": -5.2,
        "3-Year EBITDA Growth Rate": 9.0,
        "3-Year Revenue Growth Rate": 9.6
    }...
]

However, this is solution works because the web is structured with tables. For complex/irregular websites I prefer to use scrapy as we use in my job.

huangapple
  • 本文由 发表于 2023年2月6日 19:30:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75360747.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定