Web从GuruFocus网站抓取表格数据

huangapple go评论94阅读模式
英文:

Web scrape to obtain table data from guru focus site

问题

我想从GuruFocus网站上抓取特定数据。
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO

目前我正在获取数字值。例如:财务实力值为“4”(满分为10)。现在我想获取子组件的数据。

仅获取数字值的代码部分:

  1. for name in names:
  2. start_time = time.time()
  3. # 获取股票符号
  4. URL = f'https://www.gurufocus.com/search?s={name}'
  5. driver = webdriver.Chrome(ChromeDriverManager().install())
  6. driver.get(URL)
  7. html_source = driver.page_source
  8. driver.close()
  9. soup = BeautifulSoup(html_source, 'html.parser')
  10. headers = soup.find_all("span")
  11. # 仅保存第一个链接
  12. for i, head in enumerate(headers):
  13. try:
  14. h = head.find("a").get("href")
  15. link = "https://www.gurufocus.com" + h
  16. break
  17. except:
  18. pass
  19. try:
  20. # 加载链接页面
  21. driver = webdriver.Chrome(ChromeDriverManager().install())
  22. driver.get(link)
  23. html_source = driver.page_source
  24. driver.close()
  25. soup = BeautifulSoup(html_source, 'html.parser')
  26. headers = soup.find_all("span", class_="t-default bold")
  27. ratings = []
  28. for head in headers:
  29. ratings.append(int(head.get_text()))
  30. if len(ratings) == 0:
  31. continue
  32. ratings_dict = {"Financial Strength": ratings[0],
  33. "Growth Rank": ratings[1],
  34. "Momentum Rank": ratings[2],
  35. "Profitability Rank": ratings[3],
  36. "GF Value Rank": ratings[4],
  37. }
  38. print(ratings_dict)
  39. # ratings_dict = json.loads(ratings_dict)
  40. with open(f"output/gurufocus/{name}.json", 'w') as f:
  41. json.dump(str(ratings_dict), f)
  42. end_time = time.time()
  43. print("time taken for %s is: %.2f" %(name, (end_time-start_time)))
  44. except:
  45. print("no data found")

输出:

  1. {"Financial Strength": 6, "Growth Rank": 4, "Momentum Rank": 4, "Profitability Rank": 7, "GF Value Rank": 5}

期望:
我想获取完整的表格数据(如下图所示),以及排名,存储在数据框中。
Web从GuruFocus网站抓取表格数据

要获取其他特定数据,您需要更改代码如下:

英文:

I want to scrape specific data from guru focus website.
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO

Currently i am fetching number value. For example:financial strength value is "4" out of 10. Now i want to fetch sub components data as well.

code to fetch only number value:

  1. for name in names:
  2. start_time = time.time()
  3. # getting the symbol
  4. URL = f'https://www.gurufocus.com/search?s={name}'
  5. driver = webdriver.Chrome(ChromeDriverManager().install())
  6. driver.get(URL)
  7. html_source = driver.page_source
  8. driver.close()
  9. soup = BeautifulSoup(html_source, 'html.parser')
  10. headers = soup.find_all("span")
  11. # saving only the first link
  12. for i, head in enumerate(headers):
  13. try:
  14. h = head.find("a").get("href")
  15. link = "https://www.gurufocus.com" + h
  16. break
  17. except:
  18. pass
  19. try:
  20. # loading the link page
  21. driver = webdriver.Chrome(ChromeDriverManager().install())
  22. driver.get(link)
  23. html_source = driver.page_source
  24. driver.close()
  25. soup = BeautifulSoup(html_source, 'html.parser')
  26. headers = soup.find_all("span", class_="t-default bold")
  27. ratings = []
  28. for head in headers:
  29. ratings.append(int(head.get_text()))
  30. if len(ratings) == 0:
  31. continue
  32. ratings_dict = {"Financial Strength": ratings[0],
  33. "Growth Rank" : ratings[1],
  34. "Momentum Rank" : ratings[2],
  35. "Profitability Rank": ratings[3],
  36. "GF Value Rank" : ratings[4],
  37. }
  38. print(ratings_dict)
  39. # ratings_dict = json.loads(ratings_dict)
  40. with open(f"output/gurufocus/{name}.json", 'w') as f:
  41. json.dump(str(ratings_dict), f)
  42. end_time = time.time()
  43. print("time taken for %s is: %.2f" %(name, (end_time-start_time)))
  44. except:
  45. print("no data found")

output:

  1. "{'Financial Strength': 6, 'Growth Rank': 4, 'Momentum Rank': 4, 'Profitability Rank': 7, 'GF Value Rank': 5}"

Expection:
I want to fetch full table data( below image) along with rank into data frame.
Web从GuruFocus网站抓取表格数据

How do I need to change my code to obtain the other specific data?

答案1

得分: 4

以下是您要翻译的代码部分:

  1. import pandas as pd
  2. import requests
  3. import json
  4. from bs4 import BeautifulSoup
  5. from collections import ChainMap
  6. tables = pd.read_html(
  7. requests.get(
  8. 'https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO'
  9. ).text,
  10. header=0
  11. )
  12. sub_table_values = [[{record["Name"]: record["Current"]} for record in json.loads(e)] for e in [i.to_json(orient="records") for i in tables]]
  13. sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
  14. print(json.dumps(sub_formatted, indent=4))

希望这能满足您的需求。如果您有任何其他问题,请随时告诉我。

英文:

You can use Pandas to write a clean solution for this problem:

  1. import pandas as pd
  2. import requests
  3. import json
  4. from bs4 import BeautifulSoup
  5. from collections import ChainMap
  6. tables = pd.read_html(
  7. requests.get(
  8. 'https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO'
  9. ).text,
  10. header=0
  11. )
  12. sub_table_values = [[{record["Name"]: record["Current"]} for record in json.loads(e)] for e in [i.to_json(orient="records") for i in tables]]
  13. sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
  14. print(json.dumps(sub_formatted, indent=4))

Description:

  • First, I obtain all the tables and convert those to DataFrames(using pandas).
  • Then, I convert the dataframes to json and only extract the desired result(Name and Curent)
  • Format the data.

It would return:

  1. [
  2. {
  3. "WACC vs ROIC": null,
  4. "Beneish M-Score": "-2.28",
  5. "Altman Z-Score": "1.98",
  6. "Piotroski F-Score": "7/9",
  7. "Interest Coverage": "4.68",
  8. "Debt-to-EBITDA": "2.55",
  9. "Debt-to-Equity": "0.85",
  10. "Equity-to-Asset": "0.37",
  11. "Cash-To-Debt": "0.1"
  12. },
  13. {
  14. "Future 3-5Y Total Revenue Growth Rate": 13.71,
  15. "3-Year Book Growth Rate": 2.8,
  16. "3-Year FCF Growth Rate": 49.9,
  17. "3-Year EPS without NRI Growth Rate": -5.2,
  18. "3-Year EBITDA Growth Rate": 9.0,
  19. "3-Year Revenue Growth Rate": 9.6
  20. }...
  21. ]

However, this is solution works because the web is structured with tables. For complex/irregular websites I prefer to use scrapy as we use in my job.

huangapple
  • 本文由 发表于 2023年2月6日 19:30:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75360747.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定