Pandas read_html() 与包含 HTML 元素的表格

huangapple go评论102阅读模式
英文:

Pandas read_html() with table containing html elements

问题

我有以下HTML表格:

  1. <table>
  2. <thead>
  3. <th> X1 </th>
  4. <th> X2 </th>
  5. </thead>
  6. <tbody>
  7. <tr>
  8. <td>Test</td>
  9. <td><span style="..."> Test2 </span> </td>
  10. </tr>
  11. </tbody>
  12. </table>

我想要使用pd.read_html()将其解析为数据框,输出如下:

X1 X2
Test Test2

然而,我更喜欢以下输出(保留单元格内的HTML元素):

X1 X2
Test <span style="..."> Test2 </span>

pd.read_html()能够实现这个吗?

我在read_html()文档中找不到解决方案,替代方法将是手动解析。

英文:

I have the following HTML table:

  1. <table>
  2. <thead>
  3. <th> X1 </th>
  4. <th> X2 </th>
  5. </thead>
  6. <tbody>
  7. <tr>
  8. <td>Test</td>
  9. <td><span style="..."> Test2 </span> </td>
  10. </tr>
  11. </tbody>
  12. </table>

that I would want to parse to a dataframe by using pd.read_html().
The output is as follows:

X1 X2
Test Test2

However, I would prefer the following output (preserving HTML elements within a cell):

X1 X2
Test <span style="..."> Test2 </span>

Is this possible with pd.read_html()?

I couldn't find a solution in the read_html() docs, and the alternative would be manual parsing.

答案1

得分: 0

你可以修改._text_getter()如果你真的想这么做。

类似这样:

  1. import lxml.html
  2. import pandas as pd
  3. html = """
  4. <table>
  5. <thead>
  6. <th> X1 </th>
  7. <th> X2 </th>
  8. </thead>
  9. <tbody>
  10. <tr>
  11. <td>Test</td>
  12. <td><span style="..."> Test2 </span> </td>
  13. </tr>
  14. </tbody>
  15. </table>
  16. """
  17. def custom_text_getter(self, obj):
  18. result = obj.xpath("node()")[0]
  19. if isinstance(result, lxml.html.HtmlElement):
  20. result = lxml.html.tostring(result, encoding="unicode")
  21. return result
  22. pd.io.html._LxmlFrameParser._text_getter = custom_text_getter
  23. print(
  24. pd.read_html(html)[0]
  25. )
英文:

You could modify ._text_getter() if you really wanted to.

Something like:

  1. import lxml.html
  2. import pandas as pd
  3. html = """
  4. <table>
  5. <thead>
  6. <th> X1 </th>
  7. <th> X2 </th>
  8. </thead>
  9. <tbody>
  10. <tr>
  11. <td>Test</td>
  12. <td><span style="..."> Test2 </span> </td>
  13. </tr>
  14. </tbody>
  15. </table>
  16. """
  17. def custom_text_getter(self, obj):
  18. result = obj.xpath("node()")[0]
  19. if isinstance(result, lxml.html.HtmlElement):
  20. result = lxml.html.tostring(result, encoding="unicode")
  21. return result
  22. pd.io.html._LxmlFrameParser._text_getter = custom_text_getter
  23. print(
  24. pd.read_html(html)[0]
  25. )
  1. X1 X2
  2. 0 Test <span style="..."> Test2 </span>

huangapple
  • 本文由 发表于 2023年2月14日 00:38:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定