如何使用pandas解析HTML表格数据的特定部分

huangapple go评论82阅读模式
英文:

How to parse a specific part of html table data using pandas

问题

我一直在学习如何使用Pandas来抓取网页,但我遇到了一个问题,无法提取

标签内部的特定数据。

这是Pandas解析的HTML代码:

<tr data-country="Bulgaria">
    <td><i aria-hidden="true" class="circle-country-flags-22 flags-22-bulgaria display-inline-block"></i>
    <a title="Bulgaria Economic Calendar" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria">Bulgaria</a></td>
    <td>BNB</td>
    <td> <a title="Bulgaria Interest Rates" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria/interest-rate-decision">Bulgarian National Bank</a> </td>
    <td class="green"> 2.17% </td>
    <td>1.82%</td>
    <td> 35bp </td>
    <td data-custom-date="2023-04-28 00:00:00.0">Apr 28, 2023</td>
    <td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
</tr>

这是我的响应数组:

{
    'Central Bank': 'Bulgarian National Bank',
    'Change': '35bp',
    'Country': 'Bulgaria',
    'Current Rate': '2.17%',
    'Last Meeting': 'Apr 28, 2023',
    'Next Meeting': '1 day',
    'Previous Rate': '1.82%',
    'Unnamed: 1': 'BNB'
}

我特别关注的是这一行:<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>,我想将其中的"2023-05-29 10:00:00.0"解析到响应中,而不是"1 day"。

这是我迄今为止创建的代码:

import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta

url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"

r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]

# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient='records')

# Print the list of dictionaries

data = []

for row in list_of_dicts:
    data.append(row)

pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

我一直在搜索互联网,但迄今为止还没有找到解决方案,希望能在这个问题上获得帮助。

英文:

I have been learning how to scrape a web page using Pandas and I have hit a bit of a wall where I cant extract a specific piece of data that inside the <td> itself.

Here is the html which is being parsed by Pandas:

&lt;tr data-country=&quot;Bulgaria&quot;&gt; 
&lt;td&gt;&lt;i aria-hidden=&quot;true&quot; class=&quot;                                    
    circle-country-flags-22 flags-22-bulgaria     display-inline-block&quot;&gt;&lt;/i&gt;
&lt;a title=&quot;Bulgaria Economic Calendar&quot; href=&quot;https://www.myfxbook.com/forex-economic- 
 calendar/bulgaria&quot;&gt;Bulgaria&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;BNB&lt;/td&gt;
&lt;td&gt; &lt;a title=&quot;Bulgaria Interest Rates&quot; href=&quot;https://www.myfxbook.com/forex-economic- 
calendar/bulgaria/interest-rate-decision&quot;&gt;Bulgarian National Bank&lt;/a&gt; &lt;/td&gt; 
&lt;td class=&quot;green&quot;&gt; 2.17% &lt;/td&gt;
&lt;td&gt;1.82%&lt;/td&gt;
&lt;td&gt; 35bp &lt;/td&gt;
&lt;td data-custom-date=&quot;2023-04-28 00:00:00.0&quot;&gt;Apr 28, 2023&lt;/td&gt;
&lt;td data-custom-date=&quot;2023-05-29 10:00:00.0&quot;&gt;1 day&lt;/td&gt;
&lt;/tr&gt;

And here is what my response array looks like:

{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  &#39;Change&#39;: &#39;35bp&#39;,
  &#39;Country&#39;: &#39;Bulgaria&#39;,
  &#39;Current Rate&#39;: &#39;2.17%&#39;,
  &#39;Last Meeting&#39;: &#39;Apr 28, 2023&#39;,
  &#39;Next Meeting&#39;: &#39;1 day&#39;,
  &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  &#39;Unnamed: 1&#39;: &#39;BNB&#39;}

This is the line I am specifically looking at "<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>"

As I am trying to parse this "2023-05-29 10:00:00.0" into the response instead of "1 day"

Here is the code I have created for this so far:

import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta


url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;

r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]

# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient=&#39;records&#39;)

# Print the list of dictionaries

data = []

for row in list_of_dicts:
    data.append(row)


pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

I have been scouring the interwebs but have not been able to find a solution so far as to how I do this so any help would be appreciated on this one.

答案1

得分: 1

简单的解决方案是使用HTML解析器(如beautifulsoup)并替换&lt;td&gt;标签的文本。然后使用pd.read_html获取数据框:

import pprint
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)

# 选择所有具有data-custom-date属性的标签
for tag in soup.select(&#39;[data-custom-date]&#39;):
    # 用此属性的值替换这些标签的文本
    tag.string.replace_with(tag[&#39;data-custom-date&#39;])

parsed_table = pd.read_html(str(soup))[0]
data = parsed_table.to_dict(orient=&quot;records&quot;)

pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

打印:

[{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  &#39;Change&#39;: &#39;35bp&#39;,
  &#39;Country&#39;: &#39;Bulgaria&#39;,
  &#39;Current Rate&#39;: &#39;2.17%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-04-28 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 10:00:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  &#39;Unnamed: 1&#39;: &#39;BNB&#39;},
 {&#39;Central Bank&#39;: &#39;Central Bank of Kenya&#39;,
  &#39;Change&#39;: &#39;75bp&#39;,
  &#39;Country&#39;: &#39;Kenya&#39;,
  &#39;Current Rate&#39;: &#39;9.5%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-03-29 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 13:30:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;8.75%&#39;,
  &#39;Unnamed: 1&#39;: &#39;CBK&#39;},
 {&#39;Central Bank&#39;: &#39;National Bank of the Kyrgyz Republic&#39;,
  &#39;Change&#39;: &#39;0bp&#39;,
  &#39;Country&#39;: &#39;Kyrgyzstan&#39;,
  &#39;Current Rate&#39;: &#39;13.0%&#39;,
  
...依此类推。
英文:

Easy solution would be using a HTML parser (such as beautifulsoup) and replace the text of &lt;td&gt; tags. Then use pd.read_html to get the dataframe:

import pprint
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)

# select all tags with data-custom-date= attribute
for tag in soup.select(&#39;[data-custom-date]&#39;):
    # replace the text of these tags with value of this attribute
    tag.string.replace_with(tag[&#39;data-custom-date&#39;])

parsed_table = pd.read_html(str(soup))[0]
data = parsed_table.to_dict(orient=&quot;records&quot;)

pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

Prints:

[{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  &#39;Change&#39;: &#39;35bp&#39;,
  &#39;Country&#39;: &#39;Bulgaria&#39;,
  &#39;Current Rate&#39;: &#39;2.17%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-04-28 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 10:00:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  &#39;Unnamed: 1&#39;: &#39;BNB&#39;},
 {&#39;Central Bank&#39;: &#39;Central Bank of Kenya&#39;,
  &#39;Change&#39;: &#39;75bp&#39;,
  &#39;Country&#39;: &#39;Kenya&#39;,
  &#39;Current Rate&#39;: &#39;9.5%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-03-29 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 13:30:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;8.75%&#39;,
  &#39;Unnamed: 1&#39;: &#39;CBK&#39;},
 {&#39;Central Bank&#39;: &#39;National Bank of the Kyrgyz Republic&#39;,
  &#39;Change&#39;: &#39;0bp&#39;,
  &#39;Country&#39;: &#39;Kyrgyzstan&#39;,
  &#39;Current Rate&#39;: &#39;13.0%&#39;,

...and so on.

huangapple
  • 本文由 发表于 2023年5月28日 06:58:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349343.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定