2023年3月3日 20:22:59go评论103阅读模式

英文:

How to organise DataFrame columns

问题

我正在尝试根据特定规则整理DataFrame的列，但我不知道方法。

例如，我有一个与化学相关的DataFrame，如下所示。
每一行显示化合物中化学键的数量。

   OH  HO  CaO  OCa  OO  NaMg  MgNa
0   2   3    2    0   1     1     1
1   0   2    3    4   5     2     0
2   1   2    3    0   0     0     0

在化学中，OH（氧氢）键等于HO（氢氧）键，CaO（钙氧）键等于OCa（氧钙）键的含义相同。因此，我想将DataFrame整理如下所示。

   OH  CaO  OO  NaMg 
0   5    2   1     2
1   2    7   9     2
2   3    3   0     0

我感到困惑，因为：

我的实际DataFrame中有各种各样的化学键，所以不可能逐个整理信息（列数超过3,000个，我不知道存在哪些化学键种类和重复项）。
每个元素符号的字母数不同，有些符号包括小写字母
（例如，氢：H（一个大写字母），钙：Ca（两个字母，大写和小写））

我在网上查找了相同的问题并自己编写了代码，但我无法找到解决方法。我想知道解决我的问题的代码。

英文:

I am trying to organise DataFrame columns based on the specific rules, but I don't know the way.

For example, I have a DataFrame related to chemistry as shown below.
Each row shows the number of chemical bonds in a chemical compound.

   OH  HO  CaO  OCa  OO  NaMg  MgNa
0   2   3    2    0   1     1     1
1   0   2    3    4   5     2     0
2   1   2    3    0   0     0     0

In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I'd like to organise the DataFrame as shown below.

   OH  CaO  OO  NaMg 
0   5    2   1     2
1   2    7   9     2
2   3    3   0     0

I’m struggling because:

there are a variety of chemical bonds in my real DataFrame, so it is impossible to organise the information one by one (The number of columns is more than 3,000 and I don't know which kinds of chemical bonds exist and are duplicates.)
the number of letters depends on each element symbol and some symbols include lowercase
(e.g. Hydrogen: H (one letter and only uppercase), Calcium: Ca (Two letters and uppercase & lowercase)

I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.

答案1

得分: 6

你可以使用 str.findall 提取单个元素<strike>并使用 frozenset</strike>以及对单个元素进行排序以重新组织成对。使用 frozenset 不是一个好的解决方案，因为对于 OO，第二个元素将丢失。

现在你可以按这些集合进行分组并应用求和:

# 修改自 https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])';
grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x)))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)

输出:

&gt;&gt;&gt; out
   CaO  HO  MgNa  OO
0    2   5     2   1
1    7   2     2   5
2    3   3     0   0

英文:

You can use str.findall to extract individual element <strike>and use frozenset</strike> and sort individual elements to reorganize the pairs. Using frozenset is not a good solution because for OO, the second will be lost.

Now you can group by this sets and apply sum:

# Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r&#39;(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])&#39;
grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
out = df.groupby(grp, axis=1).sum().rename(columns=&#39;&#39;.join)

Output:

&gt;&gt;&gt; out
   CaO  HO  MgNa  OO
0    2   5     2   1
1    7   2     2   5
2    3   3     0   0

答案2

得分: 4

另一种方法是使用正则表达式和 sorted：

import re
sorter = lambda x: ''.join(sorted(re.findall('[A-Z][a-z]*', x)))
out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
         .sum()
       )

输出：

   HO  CaO  OO  MgNa
0   5    2   1     2
1   2    7   5     2
2   3    3   0     0

英文:

Another approach using a regex and sorted:

import re
sorter = lambda x: &#39;&#39;.join(sorted(re.findall(&#39;[A-Z][a-z]*&#39;, x)))
out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
         .sum()
       )

Output:

   HO  CaO  OO  MgNa
0   5    2   1     2
1   2    7   5     2
2   3    3   0     0

答案3

得分: 1

另一个可能的解决方案：

df.columns = (pd.DataFrame
              .from_records([ ''.join(sorted(x)), x] for x in df.columns])
              .groupby(0)[1].transform('first').to_list())
df.stack().groupby(level=[0,1]).sum().unstack()

输出：

       CaO  NaMg  OH  OO
    0    2     2   5   1
    1    7     2   2   5
    2    3     0   3   0

英文:

Another possible solution:

df.columns = (pd.DataFrame
              .from_records([[&#39;&#39;.join(sorted(x)), x] for x in df.columns])
              .groupby(0)[1].transform(&#39;first&#39;).to_list())
df.stack().groupby(level=[0,1]).sum().unstack()

Output:

   CaO  NaMg  OH  OO
0    2     2   5   1
1    7     2   2   5
2    3     0   3   0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何组织DataFrame列

问题

答案1

答案2

答案3

How can i zip() two lists together without the output having "\n" at the beginning of every second element?

Firebase函数v2触发器使用Python脚本实时数据库不起作用。

无法比较日期

如何在Python中使用多个方法？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论