Pandas read_html() 与包含 HTML 元素的表格

huangapple go评论54阅读模式
英文:

Pandas read_html() with table containing html elements

问题

我有以下HTML表格:

<table>
 <thead>
   <th> X1 </th>
   <th> X2 </th>
</thead>
<tbody>
   <tr>
    <td>Test</td>
    <td><span style="..."> Test2 </span> </td>
  </tr>
</tbody>
</table>

我想要使用pd.read_html()将其解析为数据框,输出如下:

X1 X2
Test Test2

然而,我更喜欢以下输出(保留单元格内的HTML元素):

X1 X2
Test <span style="..."> Test2 </span>

pd.read_html()能够实现这个吗?

我在read_html()文档中找不到解决方案,替代方法将是手动解析。

英文:

I have the following HTML table:

<table>
 <thead>
   <th> X1 </th>
   <th> X2 </th>
</thead>
<tbody>
   <tr>
    <td>Test</td>
    <td><span style="..."> Test2 </span> </td>
  </tr>
</tbody>
</table>

that I would want to parse to a dataframe by using pd.read_html().
The output is as follows:

X1 X2
Test Test2

However, I would prefer the following output (preserving HTML elements within a cell):

X1 X2
Test <span style="..."> Test2 </span>

Is this possible with pd.read_html()?

I couldn't find a solution in the read_html() docs, and the alternative would be manual parsing.

答案1

得分: 0

你可以修改._text_getter()如果你真的想这么做。

类似这样:

import lxml.html
import pandas as pd

html = """
<table> 
<thead> 
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody> 
<tr>   
<td>Test</td>   
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
"""

def custom_text_getter(self, obj):
   result = obj.xpath("node()")[0]
   if isinstance(result, lxml.html.HtmlElement):
      result = lxml.html.tostring(result, encoding="unicode")
   return result

pd.io.html._LxmlFrameParser._text_getter = custom_text_getter

print(
    pd.read_html(html)[0]
)
英文:

You could modify ._text_getter() if you really wanted to.

Something like:

import lxml.html
import pandas as pd

html = """
<table> 
<thead> 
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody> 
<tr>   
<td>Test</td>   
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
"""

def custom_text_getter(self, obj):
   result = obj.xpath("node()")[0]
   if isinstance(result, lxml.html.HtmlElement):
      result = lxml.html.tostring(result, encoding="unicode")
   return result

pd.io.html._LxmlFrameParser._text_getter = custom_text_getter

print(
    pd.read_html(html)[0]
)
     X1                                X2
0  Test  <span style="..."> Test2 </span>

huangapple
  • 本文由 发表于 2023年2月14日 00:38:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定