2023年2月6日 19:30:28go评论152阅读模式

英文:

Web scrape to obtain table data from guru focus site

问题

我想从GuruFocus网站上抓取特定数据。
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO

目前我正在获取数字值。例如：财务实力值为“4”（满分为10）。现在我想获取子组件的数据。

仅获取数字值的代码部分：

for name in names:
    start_time = time.time()

    # 获取股票符号
    URL = f'https://www.gurufocus.com/search?s={name}'
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(URL)
    html_source = driver.page_source
    driver.close()

    soup = BeautifulSoup(html_source, 'html.parser')

    headers = soup.find_all("span")
    
    # 仅保存第一个链接
    for i, head in enumerate(headers):
        try:
            h = head.find("a").get("href")
            link = "https://www.gurufocus.com" + h
            break
        except:
            pass

    try:
        # 加载链接页面
        driver = webdriver.Chrome(ChromeDriverManager().install())
        driver.get(link)
        html_source = driver.page_source
        driver.close()

        soup = BeautifulSoup(html_source, 'html.parser')

        headers = soup.find_all("span", class_="t-default bold")
        ratings = []
        for head in headers:
            ratings.append(int(head.get_text()))
        if len(ratings) == 0:
            continue
        ratings_dict = {"Financial Strength": ratings[0],
                        "Growth Rank": ratings[1],
                        "Momentum Rank": ratings[2],
                        "Profitability Rank": ratings[3],
                        "GF Value Rank": ratings[4],
                       }
        print(ratings_dict)
        #     ratings_dict = json.loads(ratings_dict)
        with open(f"output/gurufocus/{name}.json", 'w') as f:
            json.dump(str(ratings_dict), f)
        end_time = time.time()
        print("time taken for %s is: %.2f" %(name, (end_time-start_time)))
    except:
        print("no data found")

输出:

{"Financial Strength": 6, "Growth Rank": 4, "Momentum Rank": 4, "Profitability Rank": 7, "GF Value Rank": 5}

期望：
我想获取完整的表格数据（如下图所示），以及排名，存储在数据框中。

要获取其他特定数据，您需要更改代码如下：

英文:

I want to scrape specific data from guru focus website.
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO

Currently i am fetching number value. For example:financial strength value is "4" out of 10. Now i want to fetch sub components data as well.

code to fetch only number value:

for name in names:
    start_time = time.time()

    # getting the symbol
    URL = f&#39;https://www.gurufocus.com/search?s={name}&#39;
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(URL)
    html_source = driver.page_source
    driver.close()

    soup = BeautifulSoup(html_source, &#39;html.parser&#39;)

    headers = soup.find_all(&quot;span&quot;)
    
    # saving only the first link
    for i, head in enumerate(headers):
        try:
            h = head.find(&quot;a&quot;).get(&quot;href&quot;)
            link = &quot;https://www.gurufocus.com&quot; + h
            break
        except:
            pass

    try:
        # loading the link page
        driver = webdriver.Chrome(ChromeDriverManager().install())
        driver.get(link)
        html_source = driver.page_source
        driver.close()

        soup = BeautifulSoup(html_source, &#39;html.parser&#39;)

        headers = soup.find_all(&quot;span&quot;, class_=&quot;t-default bold&quot;)
        ratings = []
        for head in headers:
            ratings.append(int(head.get_text()))
        if len(ratings) == 0:
            continue
        ratings_dict = {&quot;Financial Strength&quot;: ratings[0],
                        &quot;Growth Rank&quot;       : ratings[1],
                        &quot;Momentum Rank&quot;     : ratings[2],
                        &quot;Profitability Rank&quot;: ratings[3],
                        &quot;GF Value Rank&quot;     : ratings[4],
                       }
        print(ratings_dict)
        #     ratings_dict = json.loads(ratings_dict)
        with open(f&quot;output/gurufocus/{name}.json&quot;, &#39;w&#39;) as f:
            json.dump(str(ratings_dict), f)
        end_time = time.time()
        print(&quot;time taken for %s is: %.2f&quot; %(name, (end_time-start_time)))
    except:
        print(&quot;no data found&quot;)

output:

&quot;{&#39;Financial Strength&#39;: 6, &#39;Growth Rank&#39;: 4, &#39;Momentum Rank&#39;: 4, &#39;Profitability Rank&#39;: 7, &#39;GF Value Rank&#39;: 5}&quot;

Expection:
I want to fetch full table data( below image) along with rank into data frame.

How do I need to change my code to obtain the other specific data?

答案1

得分: 4

以下是您要翻译的代码部分：

import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from collections import ChainMap

tables = pd.read_html(
    requests.get(
        'https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO'
    ).text,
    header=0
)
sub_table_values = [[{record["Name"]: record["Current"]} for record in json.loads(e)] for e in [i.to_json(orient="records") for i in tables]]
sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
print(json.dumps(sub_formatted, indent=4))

希望这能满足您的需求。如果您有任何其他问题，请随时告诉我。

英文:

You can use Pandas to write a clean solution for this problem:

import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from collections import ChainMap

tables = pd.read_html(
    requests.get(
        &#39;https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO&#39;
    ).text,
    header=0
)
sub_table_values = [[{record[&quot;Name&quot;]: record[&quot;Current&quot;]} for record in json.loads(e)] for e in [i.to_json(orient=&quot;records&quot;) for i in tables]]
sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
print(json.dumps(sub_formatted, indent=4))

Description:

First, I obtain all the tables and convert those to DataFrames(using pandas).
Then, I convert the dataframes to json and only extract the desired result(Name and Curent)
Format the data.

It would return:

[
    {
        &quot;WACC vs ROIC&quot;: null,
        &quot;Beneish M-Score&quot;: &quot;-2.28&quot;,
        &quot;Altman Z-Score&quot;: &quot;1.98&quot;,
        &quot;Piotroski F-Score&quot;: &quot;7/9&quot;,
        &quot;Interest Coverage&quot;: &quot;4.68&quot;,
        &quot;Debt-to-EBITDA&quot;: &quot;2.55&quot;,
        &quot;Debt-to-Equity&quot;: &quot;0.85&quot;,
        &quot;Equity-to-Asset&quot;: &quot;0.37&quot;,
        &quot;Cash-To-Debt&quot;: &quot;0.1&quot;
    },
    {
        &quot;Future 3-5Y Total Revenue Growth Rate&quot;: 13.71,
        &quot;3-Year Book Growth Rate&quot;: 2.8,
        &quot;3-Year FCF Growth Rate&quot;: 49.9,
        &quot;3-Year EPS without NRI Growth Rate&quot;: -5.2,
        &quot;3-Year EBITDA Growth Rate&quot;: 9.0,
        &quot;3-Year Revenue Growth Rate&quot;: 9.6
    }...
]

However, this is solution works because the web is structured with tables. For complex/irregular websites I prefer to use scrapy as we use in my job.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Web从GuruFocus网站抓取表格数据

问题

答案1

Description:

我正在尝试使用Python将Linux服务器命令的输出保存到Excel文件。

Etas dll 空指针异常

决策树分类器 // 准确性分数

收集所有职业竞技《英雄联盟》比赛的结果。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论