如何组织DataFrame列

huangapple go评论60阅读模式
英文:

How to organise DataFrame columns

问题

我正在尝试根据特定规则整理DataFrame的列,但我不知道方法。

例如,我有一个与化学相关的DataFrame,如下所示。
每一行显示化合物中化学键的数量。

   OH  HO  CaO  OCa  OO  NaMg  MgNa
0   2   3    2    0   1     1     1
1   0   2    3    4   5     2     0
2   1   2    3    0   0     0     0

在化学中,OH(氧氢)键等于HO(氢氧)键,CaO(钙氧)键等于OCa(氧钙)键的含义相同。因此,我想将DataFrame整理如下所示。

   OH  CaO  OO  NaMg 
0   5    2   1     2
1   2    7   9     2
2   3    3   0     0

我感到困惑,因为:

  • 我的实际DataFrame中有各种各样的化学键,所以不可能逐个整理信息(列数超过3,000个,我不知道存在哪些化学键种类和重复项)。
  • 每个元素符号的字母数不同,有些符号包括小写字母
    (例如,氢:H(一个大写字母),钙:Ca(两个字母,大写和小写))

我在网上查找了相同的问题并自己编写了代码,但我无法找到解决方法。我想知道解决我的问题的代码。

英文:

I am trying to organise DataFrame columns based on the specific rules, but I don't know the way.

For example, I have a DataFrame related to chemistry as shown below.
Each row shows the number of chemical bonds in a chemical compound.

   OH  HO  CaO  OCa  OO  NaMg  MgNa
0   2   3    2    0   1     1     1
1   0   2    3    4   5     2     0
2   1   2    3    0   0     0     0

In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I'd like to organise the DataFrame as shown below.

   OH  CaO  OO  NaMg 
0   5    2   1     2
1   2    7   9     2
2   3    3   0     0

I’m struggling because:

  • there are a variety of chemical bonds in my real DataFrame, so it is impossible to organise the information one by one (The number of columns is more than 3,000 and I don't know which kinds of chemical bonds exist and are duplicates.)
  • the number of letters depends on each element symbol and some symbols include lowercase
    (e.g. Hydrogen: H (one letter and only uppercase), Calcium: Ca (Two letters and uppercase & lowercase)

I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.

答案1

得分: 6

你可以使用 str.findall 提取单个元素<strike>并使用 frozenset</strike>以及对单个元素进行排序以重新组织成对。使用 frozenset 不是一个好的解决方案,因为对于 OO,第二个元素将丢失。

现在你可以按这些集合进行分组并应用求和:

# 修改自 https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])';

grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x)))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)

输出:

&gt;&gt;&gt; out
   CaO  HO  MgNa  OO
0    2   5     2   1
1    7   2     2   5
2    3   3     0   0
英文:

You can use str.findall to extract individual element <strike>and use frozenset</strike> and sort individual elements to reorganize the pairs. Using frozenset is not a good solution because for OO, the second will be lost.

Now you can group by this sets and apply sum:

# Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r&#39;(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])&#39;

grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
out = df.groupby(grp, axis=1).sum().rename(columns=&#39;&#39;.join)

Output:

&gt;&gt;&gt; out
   CaO  HO  MgNa  OO
0    2   5     2   1
1    7   2     2   5
2    3   3     0   0

答案2

得分: 4

另一种方法是使用正则表达式和 sorted

import re

sorter = lambda x: ''.join(sorted(re.findall('[A-Z][a-z]*', x)))

out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
         .sum()
       )

输出:

   HO  CaO  OO  MgNa
0   5    2   1     2
1   2    7   5     2
2   3    3   0     0
英文:

Another approach using a regex and sorted:

import re

sorter = lambda x: &#39;&#39;.join(sorted(re.findall(&#39;[A-Z][a-z]*&#39;, x)))

out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
         .sum()
       )

Output:

   HO  CaO  OO  MgNa
0   5    2   1     2
1   2    7   5     2
2   3    3   0     0

答案3

得分: 1

另一个可能的解决方案:

df.columns = (pd.DataFrame
              .from_records([ ''.join(sorted(x)), x] for x in df.columns])
              .groupby(0)[1].transform('first').to_list())
df.stack().groupby(level=[0,1]).sum().unstack()

输出:

       CaO  NaMg  OH  OO
    0    2     2   5   1
    1    7     2   2   5
    2    3     0   3   0
英文:

Another possible solution:

df.columns = (pd.DataFrame
              .from_records([[&#39;&#39;.join(sorted(x)), x] for x in df.columns])
              .groupby(0)[1].transform(&#39;first&#39;).to_list())
df.stack().groupby(level=[0,1]).sum().unstack()

Output:

   CaO  NaMg  OH  OO
0    2     2   5   1
1    7     2   2   5
2    3     0   3   0

huangapple
  • 本文由 发表于 2023年3月3日 20:22:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75627029.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定