英文:
Web scrape to obtain table data from guru focus site
问题
我想从GuruFocus网站上抓取特定数据。
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO
目前我正在获取数字值。例如:财务实力值为“4”(满分为10)。现在我想获取子组件的数据。
仅获取数字值的代码部分:
for name in names:
start_time = time.time()
# 获取股票符号
URL = f'https://www.gurufocus.com/search?s={name}'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(URL)
html_source = driver.page_source
driver.close()
soup = BeautifulSoup(html_source, 'html.parser')
headers = soup.find_all("span")
# 仅保存第一个链接
for i, head in enumerate(headers):
try:
h = head.find("a").get("href")
link = "https://www.gurufocus.com" + h
break
except:
pass
try:
# 加载链接页面
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(link)
html_source = driver.page_source
driver.close()
soup = BeautifulSoup(html_source, 'html.parser')
headers = soup.find_all("span", class_="t-default bold")
ratings = []
for head in headers:
ratings.append(int(head.get_text()))
if len(ratings) == 0:
continue
ratings_dict = {"Financial Strength": ratings[0],
"Growth Rank": ratings[1],
"Momentum Rank": ratings[2],
"Profitability Rank": ratings[3],
"GF Value Rank": ratings[4],
}
print(ratings_dict)
# ratings_dict = json.loads(ratings_dict)
with open(f"output/gurufocus/{name}.json", 'w') as f:
json.dump(str(ratings_dict), f)
end_time = time.time()
print("time taken for %s is: %.2f" %(name, (end_time-start_time)))
except:
print("no data found")
输出:
{"Financial Strength": 6, "Growth Rank": 4, "Momentum Rank": 4, "Profitability Rank": 7, "GF Value Rank": 5}
期望:
我想获取完整的表格数据(如下图所示),以及排名,存储在数据框中。
要获取其他特定数据,您需要更改代码如下:
英文:
I want to scrape specific data from guru focus website.
https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO
Currently i am fetching number value. For example:financial strength value is "4" out of 10. Now i want to fetch sub components data as well.
code to fetch only number value:
for name in names:
start_time = time.time()
# getting the symbol
URL = f'https://www.gurufocus.com/search?s={name}'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(URL)
html_source = driver.page_source
driver.close()
soup = BeautifulSoup(html_source, 'html.parser')
headers = soup.find_all("span")
# saving only the first link
for i, head in enumerate(headers):
try:
h = head.find("a").get("href")
link = "https://www.gurufocus.com" + h
break
except:
pass
try:
# loading the link page
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(link)
html_source = driver.page_source
driver.close()
soup = BeautifulSoup(html_source, 'html.parser')
headers = soup.find_all("span", class_="t-default bold")
ratings = []
for head in headers:
ratings.append(int(head.get_text()))
if len(ratings) == 0:
continue
ratings_dict = {"Financial Strength": ratings[0],
"Growth Rank" : ratings[1],
"Momentum Rank" : ratings[2],
"Profitability Rank": ratings[3],
"GF Value Rank" : ratings[4],
}
print(ratings_dict)
# ratings_dict = json.loads(ratings_dict)
with open(f"output/gurufocus/{name}.json", 'w') as f:
json.dump(str(ratings_dict), f)
end_time = time.time()
print("time taken for %s is: %.2f" %(name, (end_time-start_time)))
except:
print("no data found")
output:
"{'Financial Strength': 6, 'Growth Rank': 4, 'Momentum Rank': 4, 'Profitability Rank': 7, 'GF Value Rank': 5}"
Expection:
I want to fetch full table data( below image) along with rank into data frame.
How do I need to change my code to obtain the other specific data?
答案1
得分: 4
以下是您要翻译的代码部分:
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from collections import ChainMap
tables = pd.read_html(
requests.get(
'https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO'
).text,
header=0
)
sub_table_values = [[{record["Name"]: record["Current"]} for record in json.loads(e)] for e in [i.to_json(orient="records") for i in tables]]
sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
print(json.dumps(sub_formatted, indent=4))
希望这能满足您的需求。如果您有任何其他问题,请随时告诉我。
英文:
You can use Pandas to write a clean solution for this problem:
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from collections import ChainMap
tables = pd.read_html(
requests.get(
'https://www.gurufocus.com/stock/AAHTF/summary?search=AAPICO'
).text,
header=0
)
sub_table_values = [[{record["Name"]: record["Current"]} for record in json.loads(e)] for e in [i.to_json(orient="records") for i in tables]]
sub_formatted = [dict(ChainMap(*a)) for a in sub_table_values]
print(json.dumps(sub_formatted, indent=4))
Description:
- First, I obtain all the tables and convert those to DataFrames(using pandas).
- Then, I convert the dataframes to json and only extract the desired result(Name and Curent)
- Format the data.
It would return:
[
{
"WACC vs ROIC": null,
"Beneish M-Score": "-2.28",
"Altman Z-Score": "1.98",
"Piotroski F-Score": "7/9",
"Interest Coverage": "4.68",
"Debt-to-EBITDA": "2.55",
"Debt-to-Equity": "0.85",
"Equity-to-Asset": "0.37",
"Cash-To-Debt": "0.1"
},
{
"Future 3-5Y Total Revenue Growth Rate": 13.71,
"3-Year Book Growth Rate": 2.8,
"3-Year FCF Growth Rate": 49.9,
"3-Year EPS without NRI Growth Rate": -5.2,
"3-Year EBITDA Growth Rate": 9.0,
"3-Year Revenue Growth Rate": 9.6
}...
]
However, this is solution works because the web is structured with tables. For complex/irregular websites I prefer to use scrapy as we use in my job.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论