如何在Pandas数据框中拆分包含来自HTML表的合并数据的单元格?

huangapple go评论99阅读模式
英文:

How do I split cells in Pandas dataframe that have combined data from HTML tables?

问题

Pandas如何将多个单元格合并为一个单元格?是网站或函数的问题吗?

  1. import pandas as pd
  2. tables = pd.read_html('http://www.ufcstats.com/fight-details/451525a1abe30a91')

这里说有四个表格。但在这些表格中,实际数据已经合并在一起。例如,39 of 102 26 of 73 应该是两个单元格,但它是1个?

我尝试单独提取表格然后拆分它们:

  1. df = tables[0].values.tolist()
  2. print(type(df[0][0]))
  3. print(re.split("\s{2,}", df[0][0]))

这没有任何效果。它只是原样返回单元格,没有任何变化。不确定出了什么问题,甚至没有错误。

是否有一种方法可以遍历整个数据框,在存在双空格的地方进行拆分,而不改变其结构?

英文:

Pandas combining multiple cells into one cell? Website or function problem?

  1. import pandas as pd
  2. tables=pd.read_html('http://www.ufcstats.com/fight-details/451525a1abe30a91')

This says there are four tables. But in those tables, the actual data is being combined. For example 39 of 102 26 of 73 should be 2 cells. But its 1?

I tried to extract the tables individually and then split them

  1. df=tables[0].values.tolist()
  2. print(type(df[0][0]))
  3. print(re.split("\s{2,}",df[0][0]))

This achieves nothing. It just bounces back the cell with no change. Unsure of what is wrong, not even an error.

Is there a way to iterate over the entire dataframe and split where there is double spaces present? without changing its structure.

答案1

得分: 2

我不确定pandas是否能解析每个<td>标签内部的嵌套数据/段落。

但无论如何,这是使用[标签:beautifulsoup]解析第一个表格的一种方法:

  1. #pip install beautifulsoup4
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup(requests.get(url).text, "html5lib")
  4. table = soup.find("table")
  5. headers = [col.text.strip() for col in table.find("thead").find_all("th")]
  6. data = [
  7. for row in table.find_all("tr") for cell in row.find_all("td")

  8. ]

  9. df = pd.DataFrame(data).T.set_axis(headers, axis=1)

输出:

  1. | Fighter | KD | Sig. str. | Sig. str. % | Total str. | Td | Td % | Sub. att | Rev. | Ctrl |
  2. |:----------------|-----:|:------------|:--------------|:-------------|:-------|:-------|-----------:|-------:|:-------|
  3. | Jessica Andrade | 0 | 131 of 322 | 40% | 148 of 342 | 2 of 3 | 66% | 0 | 0 | 1:35 |
  4. | Angela Hill | 1 | 89 of 209 | 42% | 99 of 220 | 0 of 0 | --- | 0 | 0 | 0:06 |

使用read_html,我们可以得到:

  1. | Fighter | KD | Sig. str. | Sig. str. % | Total str. | Td | Td % | Sub. att | Rev. | Ctrl |
  2. |:----------------------------|:-----|:----------------------|:--------------|:----------------------|:---------------|:---------|:-----------|:-------|:-----------|
  3. | Jessica Andrade Angela Hill | 0 1 | 131 of 322 89 of 209 | 40% 42% | 148 of 342 99 of 220 | 2 of 3 0 of 0 | 66% --- | 0 0 | 0 0 | 1:35 0:06 |
英文:

I'm not sure if pandas can parse nested data/paragraphs inside each &lt;td&gt; tag.

But anyways, (as a good start) here is one way to parse the first table with [tag:beautifulsoup] :

  1. #pip install beautifulsoup4
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup(requests.get(url).text, &quot;html5lib&quot;)
  4. table = soup.find(&quot;table&quot;)
  5. headers = [col.text.strip() for col in table.find(&quot;thead&quot;).find_all(&quot;th&quot;)]
  6. data = [
  7. for row in table.find_all(&quot;tr&quot;) for cell in row.find_all(&quot;td&quot;)

  8. ]

  9. df = pd.DataFrame(data).T.set_axis(headers, axis=1)

Ouptut :

Fighter KD Sig. str. Sig. str. % Total str. Td Td % Sub. att Rev. Ctrl
Jessica Andrade 0 131 of 322 40% 148 of 342 2 of 3 66% 0 0 1:35
Angela Hill 1 89 of 209 42% 99 of 220 0 of 0 --- 0 0 0:06

With read_html, we get :

Fighter KD Sig. str. Sig. str. % Total str. Td Td % Sub. att Rev. Ctrl
Jessica Andrade Angela Hill 0 1 131 of 322 89 of 209 40% 42% 148 of 342 99 of 220 2 of 3 0 of 0 66% --- 0 0 0 0 1:35 0:06

huangapple
  • 本文由 发表于 2023年5月24日 20:52:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76323758.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定