2023年5月28日 06:58:38go评论113阅读模式

英文:

How to parse a specific part of html table data using pandas

问题

我一直在学习如何使用Pandas来抓取网页，但我遇到了一个问题，无法提取

标签内部的特定数据。

这是Pandas解析的HTML代码：

<tr data-country="Bulgaria">
    <td><i aria-hidden="true" class="circle-country-flags-22 flags-22-bulgaria display-inline-block"></i>
    <a title="Bulgaria Economic Calendar" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria">Bulgaria</a></td>
    <td>BNB</td>
    <td> <a title="Bulgaria Interest Rates" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria/interest-rate-decision">Bulgarian National Bank</a> </td>
    <td class="green"> 2.17% </td>
    <td>1.82%</td>
    <td> 35bp </td>
    <td data-custom-date="2023-04-28 00:00:00.0">Apr 28, 2023</td>
    <td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
</tr>

这是我的响应数组：

{
    'Central Bank': 'Bulgarian National Bank',
    'Change': '35bp',
    'Country': 'Bulgaria',
    'Current Rate': '2.17%',
    'Last Meeting': 'Apr 28, 2023',
    'Next Meeting': '1 day',
    'Previous Rate': '1.82%',
    'Unnamed: 1': 'BNB'
}

我特别关注的是这一行：<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>，我想将其中的"2023-05-29 10:00:00.0"解析到响应中，而不是"1 day"。

这是我迄今为止创建的代码：

import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta
url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]
# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient='records')
# Print the list of dictionaries
data = []
for row in list_of_dicts:
    data.append(row)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

我一直在搜索互联网，但迄今为止还没有找到解决方案，希望能在这个问题上获得帮助。

英文:

I have been learning how to scrape a web page using Pandas and I have hit a bit of a wall where I cant extract a specific piece of data that inside the <td> itself.

Here is the html which is being parsed by Pandas:

&lt;tr data-country=&quot;Bulgaria&quot;&gt; 
&lt;td&gt;&lt;i aria-hidden=&quot;true&quot; class=&quot;                                    
    circle-country-flags-22 flags-22-bulgaria     display-inline-block&quot;&gt;&lt;/i&gt;
&lt;a title=&quot;Bulgaria Economic Calendar&quot; href=&quot;https://www.myfxbook.com/forex-economic- 
 calendar/bulgaria&quot;&gt;Bulgaria&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;BNB&lt;/td&gt;
&lt;td&gt; &lt;a title=&quot;Bulgaria Interest Rates&quot; href=&quot;https://www.myfxbook.com/forex-economic- 
calendar/bulgaria/interest-rate-decision&quot;&gt;Bulgarian National Bank&lt;/a&gt; &lt;/td&gt; 
&lt;td class=&quot;green&quot;&gt; 2.17% &lt;/td&gt;
&lt;td&gt;1.82%&lt;/td&gt;
&lt;td&gt; 35bp &lt;/td&gt;
&lt;td data-custom-date=&quot;2023-04-28 00:00:00.0&quot;&gt;Apr 28, 2023&lt;/td&gt;
&lt;td data-custom-date=&quot;2023-05-29 10:00:00.0&quot;&gt;1 day&lt;/td&gt;
&lt;/tr&gt;

And here is what my response array looks like:

{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  &#39;Change&#39;: &#39;35bp&#39;,
  &#39;Country&#39;: &#39;Bulgaria&#39;,
  &#39;Current Rate&#39;: &#39;2.17%&#39;,
  &#39;Last Meeting&#39;: &#39;Apr 28, 2023&#39;,
  &#39;Next Meeting&#39;: &#39;1 day&#39;,
  &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  &#39;Unnamed: 1&#39;: &#39;BNB&#39;}

This is the line I am specifically looking at "<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>"

As I am trying to parse this "2023-05-29 10:00:00.0" into the response instead of "1 day"

Here is the code I have created for this so far:

import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta
url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]
# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient=&#39;records&#39;)
# Print the list of dictionaries
data = []
for row in list_of_dicts:
    data.append(row)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

I have been scouring the interwebs but have not been able to find a solution so far as to how I do this so any help would be appreciated on this one.

答案1

得分: 1

简单的解决方案是使用HTML解析器（如beautifulsoup）并替换<td>标签的文本。然后使用pd.read_html获取数据框：

import pprint
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)
# 选择所有具有data-custom-date属性的标签
for tag in soup.select(&#39;[data-custom-date]&#39;):
    # 用此属性的值替换这些标签的文本
    tag.string.replace_with(tag[&#39;data-custom-date&#39;])
parsed_table = pd.read_html(str(soup))[0]
data = parsed_table.to_dict(orient=&quot;records&quot;)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

打印：

[{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  &#39;Change&#39;: &#39;35bp&#39;,
  &#39;Country&#39;: &#39;Bulgaria&#39;,
  &#39;Current Rate&#39;: &#39;2.17%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-04-28 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 10:00:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  &#39;Unnamed: 1&#39;: &#39;BNB&#39;},
 {&#39;Central Bank&#39;: &#39;Central Bank of Kenya&#39;,
  &#39;Change&#39;: &#39;75bp&#39;,
  &#39;Country&#39;: &#39;Kenya&#39;,
  &#39;Current Rate&#39;: &#39;9.5%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-03-29 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 13:30:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;8.75%&#39;,
  &#39;Unnamed: 1&#39;: &#39;CBK&#39;},
 {&#39;Central Bank&#39;: &#39;National Bank of the Kyrgyz Republic&#39;,
  &#39;Change&#39;: &#39;0bp&#39;,
  &#39;Country&#39;: &#39;Kyrgyzstan&#39;,
  &#39;Current Rate&#39;: &#39;13.0%&#39;,
  
...依此类推。

英文:

Easy solution would be using a HTML parser (such as beautifulsoup) and replace the text of <td> tags. Then use pd.read_html to get the dataframe:

import pprint
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)
# select all tags with data-custom-date= attribute
for tag in soup.select(&#39;[data-custom-date]&#39;):
    # replace the text of these tags with value of this attribute
    tag.string.replace_with(tag[&#39;data-custom-date&#39;])
parsed_table = pd.read_html(str(soup))[0]
data = parsed_table.to_dict(orient=&quot;records&quot;)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

Prints:

[{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  &#39;Change&#39;: &#39;35bp&#39;,
  &#39;Country&#39;: &#39;Bulgaria&#39;,
  &#39;Current Rate&#39;: &#39;2.17%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-04-28 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 10:00:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  &#39;Unnamed: 1&#39;: &#39;BNB&#39;},
 {&#39;Central Bank&#39;: &#39;Central Bank of Kenya&#39;,
  &#39;Change&#39;: &#39;75bp&#39;,
  &#39;Country&#39;: &#39;Kenya&#39;,
  &#39;Current Rate&#39;: &#39;9.5%&#39;,
  &#39;Last Meeting&#39;: &#39;2023-03-29 00:00:00.0&#39;,
  &#39;Next Meeting&#39;: &#39;2023-05-29 13:30:00.0&#39;,
  &#39;Previous Rate&#39;: &#39;8.75%&#39;,
  &#39;Unnamed: 1&#39;: &#39;CBK&#39;},
 {&#39;Central Bank&#39;: &#39;National Bank of the Kyrgyz Republic&#39;,
  &#39;Change&#39;: &#39;0bp&#39;,
  &#39;Country&#39;: &#39;Kyrgyzstan&#39;,
  &#39;Current Rate&#39;: &#39;13.0%&#39;,
...and so on.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用pandas解析HTML表格数据的特定部分

问题

答案1

Deep Learning with Python IMDB dataset

使用 Pandas 数据框的日期列创建额外行。

如何在CircuitPython中对1到X的整数列表进行无重复随机化？

在`integrate.quadrature`中，`max`函数失败。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。