如何使用pandas解析HTML表格数据的特定部分

huangapple go评论113阅读模式
英文:

How to parse a specific part of html table data using pandas

问题

我一直在学习如何使用Pandas来抓取网页,但我遇到了一个问题,无法提取

标签内部的特定数据。

这是Pandas解析的HTML代码:

  1. <tr data-country="Bulgaria">
  2. <td><i aria-hidden="true" class="circle-country-flags-22 flags-22-bulgaria display-inline-block"></i>
  3. <a title="Bulgaria Economic Calendar" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria">Bulgaria</a></td>
  4. <td>BNB</td>
  5. <td> <a title="Bulgaria Interest Rates" href="https://www.myfxbook.com/forex-economic-calendar/bulgaria/interest-rate-decision">Bulgarian National Bank</a> </td>
  6. <td class="green"> 2.17% </td>
  7. <td>1.82%</td>
  8. <td> 35bp </td>
  9. <td data-custom-date="2023-04-28 00:00:00.0">Apr 28, 2023</td>
  10. <td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
  11. </tr>

这是我的响应数组:

  1. {
  2. 'Central Bank': 'Bulgarian National Bank',
  3. 'Change': '35bp',
  4. 'Country': 'Bulgaria',
  5. 'Current Rate': '2.17%',
  6. 'Last Meeting': 'Apr 28, 2023',
  7. 'Next Meeting': '1 day',
  8. 'Previous Rate': '1.82%',
  9. 'Unnamed: 1': 'BNB'
  10. }

我特别关注的是这一行:<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>,我想将其中的"2023-05-29 10:00:00.0"解析到响应中,而不是"1 day"。

这是我迄今为止创建的代码:

  1. import pandas as pd
  2. import requests
  3. import pprint
  4. from datetime import datetime, timedelta
  5. url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
  6. r = requests.get(url)
  7. tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
  8. # Extract the first table from the list of parsed tables
  9. parsed_table = tables[0]
  10. # Convert DataFrame to list of dictionaries
  11. list_of_dicts = parsed_table.to_dict(orient='records')
  12. # Print the list of dictionaries
  13. data = []
  14. for row in list_of_dicts:
  15. data.append(row)
  16. pp = pprint.PrettyPrinter(depth=4)
  17. pp.pprint(data)

我一直在搜索互联网,但迄今为止还没有找到解决方案,希望能在这个问题上获得帮助。

英文:

I have been learning how to scrape a web page using Pandas and I have hit a bit of a wall where I cant extract a specific piece of data that inside the <td> itself.

Here is the html which is being parsed by Pandas:

  1. &lt;tr data-country=&quot;Bulgaria&quot;&gt;
  2. &lt;td&gt;&lt;i aria-hidden=&quot;true&quot; class=&quot;
  3. circle-country-flags-22 flags-22-bulgaria display-inline-block&quot;&gt;&lt;/i&gt;
  4. &lt;a title=&quot;Bulgaria Economic Calendar&quot; href=&quot;https://www.myfxbook.com/forex-economic-
  5. calendar/bulgaria&quot;&gt;Bulgaria&lt;/a&gt;&lt;/td&gt;
  6. &lt;td&gt;BNB&lt;/td&gt;
  7. &lt;td&gt; &lt;a title=&quot;Bulgaria Interest Rates&quot; href=&quot;https://www.myfxbook.com/forex-economic-
  8. calendar/bulgaria/interest-rate-decision&quot;&gt;Bulgarian National Bank&lt;/a&gt; &lt;/td&gt;
  9. &lt;td class=&quot;green&quot;&gt; 2.17% &lt;/td&gt;
  10. &lt;td&gt;1.82%&lt;/td&gt;
  11. &lt;td&gt; 35bp &lt;/td&gt;
  12. &lt;td data-custom-date=&quot;2023-04-28 00:00:00.0&quot;&gt;Apr 28, 2023&lt;/td&gt;
  13. &lt;td data-custom-date=&quot;2023-05-29 10:00:00.0&quot;&gt;1 day&lt;/td&gt;
  14. &lt;/tr&gt;

And here is what my response array looks like:

  1. {&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  2. &#39;Change&#39;: &#39;35bp&#39;,
  3. &#39;Country&#39;: &#39;Bulgaria&#39;,
  4. &#39;Current Rate&#39;: &#39;2.17%&#39;,
  5. &#39;Last Meeting&#39;: &#39;Apr 28, 2023&#39;,
  6. &#39;Next Meeting&#39;: &#39;1 day&#39;,
  7. &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  8. &#39;Unnamed: 1&#39;: &#39;BNB&#39;}

This is the line I am specifically looking at "<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>"

As I am trying to parse this "2023-05-29 10:00:00.0" into the response instead of "1 day"

Here is the code I have created for this so far:

  1. import pandas as pd
  2. import requests
  3. import pprint
  4. from datetime import datetime, timedelta
  5. url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
  6. r = requests.get(url)
  7. tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
  8. # Extract the first table from the list of parsed tables
  9. parsed_table = tables[0]
  10. # Convert DataFrame to list of dictionaries
  11. list_of_dicts = parsed_table.to_dict(orient=&#39;records&#39;)
  12. # Print the list of dictionaries
  13. data = []
  14. for row in list_of_dicts:
  15. data.append(row)
  16. pp = pprint.PrettyPrinter(depth=4)
  17. pp.pprint(data)

I have been scouring the interwebs but have not been able to find a solution so far as to how I do this so any help would be appreciated on this one.

答案1

得分: 1

简单的解决方案是使用HTML解析器(如beautifulsoup)并替换&lt;td&gt;标签的文本。然后使用pd.read_html获取数据框:

  1. import pprint
  2. import requests
  3. import pandas as pd
  4. from bs4 import BeautifulSoup
  5. url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
  6. soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)
  7. # 选择所有具有data-custom-date属性的标签
  8. for tag in soup.select(&#39;[data-custom-date]&#39;):
  9. # 用此属性的值替换这些标签的文本
  10. tag.string.replace_with(tag[&#39;data-custom-date&#39;])
  11. parsed_table = pd.read_html(str(soup))[0]
  12. data = parsed_table.to_dict(orient=&quot;records&quot;)
  13. pp = pprint.PrettyPrinter(depth=4)
  14. pp.pprint(data)

打印:

  1. [{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  2. &#39;Change&#39;: &#39;35bp&#39;,
  3. &#39;Country&#39;: &#39;Bulgaria&#39;,
  4. &#39;Current Rate&#39;: &#39;2.17%&#39;,
  5. &#39;Last Meeting&#39;: &#39;2023-04-28 00:00:00.0&#39;,
  6. &#39;Next Meeting&#39;: &#39;2023-05-29 10:00:00.0&#39;,
  7. &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  8. &#39;Unnamed: 1&#39;: &#39;BNB&#39;},
  9. {&#39;Central Bank&#39;: &#39;Central Bank of Kenya&#39;,
  10. &#39;Change&#39;: &#39;75bp&#39;,
  11. &#39;Country&#39;: &#39;Kenya&#39;,
  12. &#39;Current Rate&#39;: &#39;9.5%&#39;,
  13. &#39;Last Meeting&#39;: &#39;2023-03-29 00:00:00.0&#39;,
  14. &#39;Next Meeting&#39;: &#39;2023-05-29 13:30:00.0&#39;,
  15. &#39;Previous Rate&#39;: &#39;8.75%&#39;,
  16. &#39;Unnamed: 1&#39;: &#39;CBK&#39;},
  17. {&#39;Central Bank&#39;: &#39;National Bank of the Kyrgyz Republic&#39;,
  18. &#39;Change&#39;: &#39;0bp&#39;,
  19. &#39;Country&#39;: &#39;Kyrgyzstan&#39;,
  20. &#39;Current Rate&#39;: &#39;13.0%&#39;,
  21. ...依此类推。
英文:

Easy solution would be using a HTML parser (such as beautifulsoup) and replace the text of &lt;td&gt; tags. Then use pd.read_html to get the dataframe:

  1. import pprint
  2. import requests
  3. import pandas as pd
  4. from bs4 import BeautifulSoup
  5. url = &quot;https://www.myfxbook.com/forex-economic-calendar/interest-rates&quot;
  6. soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)
  7. # select all tags with data-custom-date= attribute
  8. for tag in soup.select(&#39;[data-custom-date]&#39;):
  9. # replace the text of these tags with value of this attribute
  10. tag.string.replace_with(tag[&#39;data-custom-date&#39;])
  11. parsed_table = pd.read_html(str(soup))[0]
  12. data = parsed_table.to_dict(orient=&quot;records&quot;)
  13. pp = pprint.PrettyPrinter(depth=4)
  14. pp.pprint(data)

Prints:

  1. [{&#39;Central Bank&#39;: &#39;Bulgarian National Bank&#39;,
  2. &#39;Change&#39;: &#39;35bp&#39;,
  3. &#39;Country&#39;: &#39;Bulgaria&#39;,
  4. &#39;Current Rate&#39;: &#39;2.17%&#39;,
  5. &#39;Last Meeting&#39;: &#39;2023-04-28 00:00:00.0&#39;,
  6. &#39;Next Meeting&#39;: &#39;2023-05-29 10:00:00.0&#39;,
  7. &#39;Previous Rate&#39;: &#39;1.82%&#39;,
  8. &#39;Unnamed: 1&#39;: &#39;BNB&#39;},
  9. {&#39;Central Bank&#39;: &#39;Central Bank of Kenya&#39;,
  10. &#39;Change&#39;: &#39;75bp&#39;,
  11. &#39;Country&#39;: &#39;Kenya&#39;,
  12. &#39;Current Rate&#39;: &#39;9.5%&#39;,
  13. &#39;Last Meeting&#39;: &#39;2023-03-29 00:00:00.0&#39;,
  14. &#39;Next Meeting&#39;: &#39;2023-05-29 13:30:00.0&#39;,
  15. &#39;Previous Rate&#39;: &#39;8.75%&#39;,
  16. &#39;Unnamed: 1&#39;: &#39;CBK&#39;},
  17. {&#39;Central Bank&#39;: &#39;National Bank of the Kyrgyz Republic&#39;,
  18. &#39;Change&#39;: &#39;0bp&#39;,
  19. &#39;Country&#39;: &#39;Kyrgyzstan&#39;,
  20. &#39;Current Rate&#39;: &#39;13.0%&#39;,
  21. ...and so on.

huangapple
  • 本文由 发表于 2023年5月28日 06:58:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349343.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定