英文:
Pandas read_html() with table containing html elements
问题
我有以下HTML表格:
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
我想要使用pd.read_html()将其解析为数据框,输出如下:
X1 | X2 |
---|---|
Test | Test2 |
然而,我更喜欢以下输出(保留单元格内的HTML元素):
X1 | X2 |
---|---|
Test | <span style="..."> Test2 </span> |
pd.read_html()能够实现这个吗?
我在read_html()文档中找不到解决方案,替代方法将是手动解析。
英文:
I have the following HTML table:
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
that I would want to parse to a dataframe by using pd.read_html().
The output is as follows:
X1 | X2 |
---|---|
Test | Test2 |
However, I would prefer the following output (preserving HTML elements within a cell):
X1 | X2 |
---|---|
Test | <span style="..."> Test2 </span> |
Is this possible with pd.read_html()?
I couldn't find a solution in the read_html() docs, and the alternative would be manual parsing.
答案1
得分: 0
你可以修改._text_getter()
如果你真的想这么做。
类似这样:
import lxml.html
import pandas as pd
html = """
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
"""
def custom_text_getter(self, obj):
result = obj.xpath("node()")[0]
if isinstance(result, lxml.html.HtmlElement):
result = lxml.html.tostring(result, encoding="unicode")
return result
pd.io.html._LxmlFrameParser._text_getter = custom_text_getter
print(
pd.read_html(html)[0]
)
英文:
You could modify ._text_getter()
if you really wanted to.
Something like:
import lxml.html
import pandas as pd
html = """
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
"""
def custom_text_getter(self, obj):
result = obj.xpath("node()")[0]
if isinstance(result, lxml.html.HtmlElement):
result = lxml.html.tostring(result, encoding="unicode")
return result
pd.io.html._LxmlFrameParser._text_getter = custom_text_getter
print(
pd.read_html(html)[0]
)
X1 X2
0 Test <span style="..."> Test2 </span>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论