英文:
How to parse a specific part of html table data using pandas
问题
我一直在学习如何使用Pandas来抓取网页,但我遇到了一个问题,无法提取
这是Pandas解析的HTML代码:
<tr data-country="Bulgaria">
<td><i aria-hidden="true" class="circle-country-flags-22 flags-22-bulgaria display-inline-block"></i>
<a title="Bulgaria Economic Calendar" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria">Bulgaria</a></td>
<td>BNB</td>
<td> <a title="Bulgaria Interest Rates" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria/interest-rate-decision">Bulgarian National Bank</a> </td>
<td class="green"> 2.17% </td>
<td>1.82%</td>
<td> 35bp </td>
<td data-custom-date="2023-04-28 00:00:00.0">Apr 28, 2023</td>
<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
</tr>
这是我的响应数组:
{
'Central Bank': 'Bulgarian National Bank',
'Change': '35bp',
'Country': 'Bulgaria',
'Current Rate': '2.17%',
'Last Meeting': 'Apr 28, 2023',
'Next Meeting': '1 day',
'Previous Rate': '1.82%',
'Unnamed: 1': 'BNB'
}
我特别关注的是这一行:<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
,我想将其中的"2023-05-29 10:00:00.0"解析到响应中,而不是"1 day"。
这是我迄今为止创建的代码:
import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta
url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]
# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient='records')
# Print the list of dictionaries
data = []
for row in list_of_dicts:
data.append(row)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)
我一直在搜索互联网,但迄今为止还没有找到解决方案,希望能在这个问题上获得帮助。
英文:
I have been learning how to scrape a web page using Pandas and I have hit a bit of a wall where I cant extract a specific piece of data that inside the <td> itself.
Here is the html which is being parsed by Pandas:
<tr data-country="Bulgaria">
<td><i aria-hidden="true" class="
circle-country-flags-22 flags-22-bulgaria display-inline-block"></i>
<a title="Bulgaria Economic Calendar" href="https://www.myfxbook.com/forex-economic-
calendar/bulgaria">Bulgaria</a></td>
<td>BNB</td>
<td> <a title="Bulgaria Interest Rates" href="https://www.myfxbook.com/forex-economic-
calendar/bulgaria/interest-rate-decision">Bulgarian National Bank</a> </td>
<td class="green"> 2.17% </td>
<td>1.82%</td>
<td> 35bp </td>
<td data-custom-date="2023-04-28 00:00:00.0">Apr 28, 2023</td>
<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
</tr>
And here is what my response array looks like:
{'Central Bank': 'Bulgarian National Bank',
'Change': '35bp',
'Country': 'Bulgaria',
'Current Rate': '2.17%',
'Last Meeting': 'Apr 28, 2023',
'Next Meeting': '1 day',
'Previous Rate': '1.82%',
'Unnamed: 1': 'BNB'}
This is the line I am specifically looking at "<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>"
As I am trying to parse this "2023-05-29 10:00:00.0" into the response instead of "1 day"
Here is the code I have created for this so far:
import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta
url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]
# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient='records')
# Print the list of dictionaries
data = []
for row in list_of_dicts:
data.append(row)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)
I have been scouring the interwebs but have not been able to find a solution so far as to how I do this so any help would be appreciated on this one.
答案1
得分: 1
简单的解决方案是使用HTML解析器(如beautifulsoup
)并替换<td>
标签的文本。然后使用pd.read_html
获取数据框:
import pprint
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# 选择所有具有data-custom-date属性的标签
for tag in soup.select('[data-custom-date]'):
# 用此属性的值替换这些标签的文本
tag.string.replace_with(tag['data-custom-date'])
parsed_table = pd.read_html(str(soup))[0]
data = parsed_table.to_dict(orient="records")
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)
打印:
[{'Central Bank': 'Bulgarian National Bank',
'Change': '35bp',
'Country': 'Bulgaria',
'Current Rate': '2.17%',
'Last Meeting': '2023-04-28 00:00:00.0',
'Next Meeting': '2023-05-29 10:00:00.0',
'Previous Rate': '1.82%',
'Unnamed: 1': 'BNB'},
{'Central Bank': 'Central Bank of Kenya',
'Change': '75bp',
'Country': 'Kenya',
'Current Rate': '9.5%',
'Last Meeting': '2023-03-29 00:00:00.0',
'Next Meeting': '2023-05-29 13:30:00.0',
'Previous Rate': '8.75%',
'Unnamed: 1': 'CBK'},
{'Central Bank': 'National Bank of the Kyrgyz Republic',
'Change': '0bp',
'Country': 'Kyrgyzstan',
'Current Rate': '13.0%',
...依此类推。
英文:
Easy solution would be using a HTML parser (such as beautifulsoup
) and replace the text of <td>
tags. Then use pd.read_html
to get the dataframe:
import pprint
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# select all tags with data-custom-date= attribute
for tag in soup.select('[data-custom-date]'):
# replace the text of these tags with value of this attribute
tag.string.replace_with(tag['data-custom-date'])
parsed_table = pd.read_html(str(soup))[0]
data = parsed_table.to_dict(orient="records")
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)
Prints:
[{'Central Bank': 'Bulgarian National Bank',
'Change': '35bp',
'Country': 'Bulgaria',
'Current Rate': '2.17%',
'Last Meeting': '2023-04-28 00:00:00.0',
'Next Meeting': '2023-05-29 10:00:00.0',
'Previous Rate': '1.82%',
'Unnamed: 1': 'BNB'},
{'Central Bank': 'Central Bank of Kenya',
'Change': '75bp',
'Country': 'Kenya',
'Current Rate': '9.5%',
'Last Meeting': '2023-03-29 00:00:00.0',
'Next Meeting': '2023-05-29 13:30:00.0',
'Previous Rate': '8.75%',
'Unnamed: 1': 'CBK'},
{'Central Bank': 'National Bank of the Kyrgyz Republic',
'Change': '0bp',
'Country': 'Kyrgyzstan',
'Current Rate': '13.0%',
...and so on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
- html-parsing
- pandas
- python
- web-scraping
评论