英文:
How do I split cells in Pandas dataframe that have combined data from HTML tables?
问题
Pandas如何将多个单元格合并为一个单元格?是网站或函数的问题吗?
import pandas as pd
tables = pd.read_html('http://www.ufcstats.com/fight-details/451525a1abe30a91')
这里说有四个表格。但在这些表格中,实际数据已经合并在一起。例如,39 of 102 26 of 73 应该是两个单元格,但它是1个?
我尝试单独提取表格然后拆分它们:
df = tables[0].values.tolist()
print(type(df[0][0]))
print(re.split("\s{2,}", df[0][0]))
这没有任何效果。它只是原样返回单元格,没有任何变化。不确定出了什么问题,甚至没有错误。
是否有一种方法可以遍历整个数据框,在存在双空格的地方进行拆分,而不改变其结构?
英文:
Pandas combining multiple cells into one cell? Website or function problem?
import pandas as pd
tables=pd.read_html('http://www.ufcstats.com/fight-details/451525a1abe30a91')
This says there are four tables. But in those tables, the actual data is being combined. For example 39 of 102 26 of 73 should be 2 cells. But its 1?
I tried to extract the tables individually and then split them
df=tables[0].values.tolist()
print(type(df[0][0]))
print(re.split("\s{2,}",df[0][0]))
This achieves nothing. It just bounces back the cell with no change. Unsure of what is wrong, not even an error.
Is there a way to iterate over the entire dataframe and split where there is double spaces present? without changing its structure.
答案1
得分: 2
我不确定pandas是否能解析每个<td>
标签内部的嵌套数据/段落。
但无论如何,这是使用[标签:beautifulsoup]解析第一个表格的一种方法:
#pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get(url).text, "html5lib")
table = soup.find("table")
headers = [col.text.strip() for col in table.find("thead").find_all("th")]
data = [
for row in table.find_all("tr") for cell in row.find_all("td")
]
df = pd.DataFrame(data).T.set_axis(headers, axis=1)
输出:
| Fighter | KD | Sig. str. | Sig. str. % | Total str. | Td | Td % | Sub. att | Rev. | Ctrl |
|:----------------|-----:|:------------|:--------------|:-------------|:-------|:-------|-----------:|-------:|:-------|
| Jessica Andrade | 0 | 131 of 322 | 40% | 148 of 342 | 2 of 3 | 66% | 0 | 0 | 1:35 |
| Angela Hill | 1 | 89 of 209 | 42% | 99 of 220 | 0 of 0 | --- | 0 | 0 | 0:06 |
使用read_html
,我们可以得到:
| Fighter | KD | Sig. str. | Sig. str. % | Total str. | Td | Td % | Sub. att | Rev. | Ctrl |
|:----------------------------|:-----|:----------------------|:--------------|:----------------------|:---------------|:---------|:-----------|:-------|:-----------|
| Jessica Andrade Angela Hill | 0 1 | 131 of 322 89 of 209 | 40% 42% | 148 of 342 99 of 220 | 2 of 3 0 of 0 | 66% --- | 0 0 | 0 0 | 1:35 0:06 |
英文:
I'm not sure if pandas can parse nested data/paragraphs inside each <td>
tag.
But anyways, (as a good start) here is one way to parse the first table with [tag:beautifulsoup] :
#pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get(url).text, "html5lib")
table = soup.find("table")
headers = [col.text.strip() for col in table.find("thead").find_all("th")]
data = [
for row in table.find_all("tr") for cell in row.find_all("td")
]
df = pd.DataFrame(data).T.set_axis(headers, axis=1)
Ouptut :
Fighter | KD | Sig. str. | Sig. str. % | Total str. | Td | Td % | Sub. att | Rev. | Ctrl |
---|---|---|---|---|---|---|---|---|---|
Jessica Andrade | 0 | 131 of 322 | 40% | 148 of 342 | 2 of 3 | 66% | 0 | 0 | 1:35 |
Angela Hill | 1 | 89 of 209 | 42% | 99 of 220 | 0 of 0 | --- | 0 | 0 | 0:06 |
With read_html
, we get :
Fighter | KD | Sig. str. | Sig. str. % | Total str. | Td | Td % | Sub. att | Rev. | Ctrl |
---|---|---|---|---|---|---|---|---|---|
Jessica Andrade Angela Hill | 0 1 | 131 of 322 89 of 209 | 40% 42% | 148 of 342 99 of 220 | 2 of 3 0 of 0 | 66% --- | 0 0 | 0 0 | 1:35 0:06 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论