2023年5月24日 20:52:41go评论99阅读模式

英文:

How do I split cells in Pandas dataframe that have combined data from HTML tables?

问题

Pandas如何将多个单元格合并为一个单元格？是网站或函数的问题吗？

import pandas as pd
tables = pd.read_html('http://www.ufcstats.com/fight-details/451525a1abe30a91')

这里说有四个表格。但在这些表格中，实际数据已经合并在一起。例如，39 of 102 26 of 73 应该是两个单元格，但它是1个？

我尝试单独提取表格然后拆分它们：

df = tables[0].values.tolist()
print(type(df[0][0]))
print(re.split("\s{2,}", df[0][0]))

这没有任何效果。它只是原样返回单元格，没有任何变化。不确定出了什么问题，甚至没有错误。

是否有一种方法可以遍历整个数据框，在存在双空格的地方进行拆分，而不改变其结构？

英文:

Pandas combining multiple cells into one cell? Website or function problem?

import pandas as pd
tables=pd.read_html(&#39;http://www.ufcstats.com/fight-details/451525a1abe30a91&#39;)

This says there are four tables. But in those tables, the actual data is being combined. For example 39 of 102 26 of 73 should be 2 cells. But its 1?

I tried to extract the tables individually and then split them

df=tables[0].values.tolist()
print(type(df[0][0]))
print(re.split(&quot;\s{2,}&quot;,df[0][0]))

This achieves nothing. It just bounces back the cell with no change. Unsure of what is wrong, not even an error.

Is there a way to iterate over the entire dataframe and split where there is double spaces present? without changing its structure.

答案1

得分: 2

我不确定pandas是否能解析每个<td>标签内部的嵌套数据/段落。

但无论如何，这是使用[标签：beautifulsoup]解析第一个表格的一种方法：

#pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get(url).text, "html5lib")
table = soup.find("table")
headers = [col.text.strip() for col in table.find("thead").find_all("th")]
data = [
    
    for row in table.find_all("tr") for cell in row.find_all("td")
]
df = pd.DataFrame(data).T.set_axis(headers, axis=1)

输出：

| Fighter         |   KD | Sig. str.   | Sig. str. %   | Total str.   | Td     | Td %   |   Sub. att |   Rev. | Ctrl   |
|:----------------|-----:|:------------|:--------------|:-------------|:-------|:-------|-----------:|-------:|:-------|
| Jessica Andrade |    0 | 131 of 322  | 40%           | 148 of 342   | 2 of 3 | 66%    |          0 |      0 | 1:35   |
| Angela Hill     |    1 | 89 of 209   | 42%           | 99 of 220    | 0 of 0 | ---    |          0 |      0 | 0:06   |

使用read_html，我们可以得到：

| Fighter                     | KD   | Sig. str.             | Sig. str. %   | Total str.            | Td             | Td %     | Sub. att   | Rev.   | Ctrl       |
|:----------------------------|:-----|:----------------------|:--------------|:----------------------|:---------------|:---------|:-----------|:-------|:-----------|
| Jessica Andrade Angela Hill | 0  1 | 131 of 322  89 of 209 | 40%  42%      | 148 of 342  99 of 220 | 2 of 3  0 of 0 | 66%  --- | 0  0       | 0  0   | 1:35  0:06 |

英文:

I'm not sure if pandas can parse nested data/paragraphs inside each <td> tag.

But anyways, (as a good start) here is one way to parse the first table with [tag:beautifulsoup] :

#pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get(url).text, &quot;html5lib&quot;)
table = soup.find(&quot;table&quot;)
headers = [col.text.strip() for col in table.find(&quot;thead&quot;).find_all(&quot;th&quot;)]
data = [
    
    for row in table.find_all(&quot;tr&quot;) for cell in row.find_all(&quot;td&quot;)
]
df = pd.DataFrame(data).T.set_axis(headers, axis=1)

Ouptut :

Fighter	KD	Sig. str.	Sig. str. %	Total str.	Td	Td %	Sub. att	Rev.	Ctrl
Jessica Andrade	0	131 of 322	40%	148 of 342	2 of 3	66%	0	0	1:35
Angela Hill	1	89 of 209	42%	99 of 220	0 of 0	---	0	0	0:06

With read_html, we get :

Fighter	KD	Sig. str.	Sig. str. %	Total str.	Td	Td %	Sub. att	Rev.	Ctrl
Jessica Andrade Angela Hill	0 1	131 of 322 89 of 209	40% 42%	148 of 342 99 of 220	2 of 3 0 of 0	66% ---	0 0	0 0	1:35 0:06

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Pandas数据框中拆分包含来自HTML表的合并数据的单元格？

问题

答案1

Pandas基于月份和年份比较数值。

QTableWidget扩展列以填充，最后一列固定宽度

How to use .split() to extract HH,MM,SS separately from a 1970-1-1T00:00:00Z and get "00" instead of "0"

不能在从父类继承的子类上使用装饰器，但可以在对象本身上使用。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。