如何在Pandas数据框中拆分包含来自HTML表的合并数据的单元格?

huangapple go评论58阅读模式
英文:

How do I split cells in Pandas dataframe that have combined data from HTML tables?

问题

Pandas如何将多个单元格合并为一个单元格?是网站或函数的问题吗?

import pandas as pd

tables = pd.read_html('http://www.ufcstats.com/fight-details/451525a1abe30a91')

这里说有四个表格。但在这些表格中,实际数据已经合并在一起。例如,39 of 102 26 of 73 应该是两个单元格,但它是1个?

我尝试单独提取表格然后拆分它们:

df = tables[0].values.tolist()

print(type(df[0][0]))

print(re.split("\s{2,}", df[0][0]))

这没有任何效果。它只是原样返回单元格,没有任何变化。不确定出了什么问题,甚至没有错误。

是否有一种方法可以遍历整个数据框,在存在双空格的地方进行拆分,而不改变其结构?

英文:

Pandas combining multiple cells into one cell? Website or function problem?

import pandas as pd

tables=pd.read_html('http://www.ufcstats.com/fight-details/451525a1abe30a91')

This says there are four tables. But in those tables, the actual data is being combined. For example 39 of 102 26 of 73 should be 2 cells. But its 1?

I tried to extract the tables individually and then split them

df=tables[0].values.tolist()

print(type(df[0][0]))

print(re.split("\s{2,}",df[0][0]))

This achieves nothing. It just bounces back the cell with no change. Unsure of what is wrong, not even an error.

Is there a way to iterate over the entire dataframe and split where there is double spaces present? without changing its structure.

答案1

得分: 2

我不确定pandas是否能解析每个<td>标签内部的嵌套数据/段落。

但无论如何,这是使用[标签:beautifulsoup]解析第一个表格的一种方法:

#pip install beautifulsoup4
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get(url).text, "html5lib")

table = soup.find("table")

headers = [col.text.strip() for col in table.find("thead").find_all("th")]

data = [
    

for row in table.find_all("tr") for cell in row.find_all("td") ] df = pd.DataFrame(data).T.set_axis(headers, axis=1)

输出:

| Fighter         |   KD | Sig. str.   | Sig. str. %   | Total str.   | Td     | Td %   |   Sub. att |   Rev. | Ctrl   |
|:----------------|-----:|:------------|:--------------|:-------------|:-------|:-------|-----------:|-------:|:-------|
| Jessica Andrade |    0 | 131 of 322  | 40%           | 148 of 342   | 2 of 3 | 66%    |          0 |      0 | 1:35   |
| Angela Hill     |    1 | 89 of 209   | 42%           | 99 of 220    | 0 of 0 | ---    |          0 |      0 | 0:06   |

使用read_html,我们可以得到:

| Fighter                     | KD   | Sig. str.             | Sig. str. %   | Total str.            | Td             | Td %     | Sub. att   | Rev.   | Ctrl       |
|:----------------------------|:-----|:----------------------|:--------------|:----------------------|:---------------|:---------|:-----------|:-------|:-----------|
| Jessica Andrade Angela Hill | 0  1 | 131 of 322  89 of 209 | 40%  42%      | 148 of 342  99 of 220 | 2 of 3  0 of 0 | 66%  --- | 0  0       | 0  0   | 1:35  0:06 |
英文:

I'm not sure if pandas can parse nested data/paragraphs inside each &lt;td&gt; tag.

But anyways, (as a good start) here is one way to parse the first table with [tag:beautifulsoup] :

#pip install beautifulsoup4
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get(url).text, &quot;html5lib&quot;)

table = soup.find(&quot;table&quot;)

headers = [col.text.strip() for col in table.find(&quot;thead&quot;).find_all(&quot;th&quot;)]

data = [
    

for row in table.find_all(&quot;tr&quot;) for cell in row.find_all(&quot;td&quot;) ] df = pd.DataFrame(data).T.set_axis(headers, axis=1)

Ouptut :

Fighter KD Sig. str. Sig. str. % Total str. Td Td % Sub. att Rev. Ctrl
Jessica Andrade 0 131 of 322 40% 148 of 342 2 of 3 66% 0 0 1:35
Angela Hill 1 89 of 209 42% 99 of 220 0 of 0 --- 0 0 0:06

With read_html, we get :

Fighter KD Sig. str. Sig. str. % Total str. Td Td % Sub. att Rev. Ctrl
Jessica Andrade Angela Hill 0 1 131 of 322 89 of 209 40% 42% 148 of 342 99 of 220 2 of 3 0 of 0 66% --- 0 0 0 0 1:35 0:06

huangapple
  • 本文由 发表于 2023年5月24日 20:52:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76323758.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定