英文:
How to organise DataFrame columns
问题
我正在尝试根据特定规则整理DataFrame的列,但我不知道方法。
例如,我有一个与化学相关的DataFrame,如下所示。
每一行显示化合物中化学键的数量。
OH HO CaO OCa OO NaMg MgNa
0 2 3 2 0 1 1 1
1 0 2 3 4 5 2 0
2 1 2 3 0 0 0 0
在化学中,OH(氧氢)键等于HO(氢氧)键,CaO(钙氧)键等于OCa(氧钙)键的含义相同。因此,我想将DataFrame整理如下所示。
OH CaO OO NaMg
0 5 2 1 2
1 2 7 9 2
2 3 3 0 0
我感到困惑,因为:
- 我的实际DataFrame中有各种各样的化学键,所以不可能逐个整理信息(列数超过3,000个,我不知道存在哪些化学键种类和重复项)。
- 每个元素符号的字母数不同,有些符号包括小写字母
(例如,氢:H(一个大写字母),钙:Ca(两个字母,大写和小写))
我在网上查找了相同的问题并自己编写了代码,但我无法找到解决方法。我想知道解决我的问题的代码。
英文:
I am trying to organise DataFrame columns based on the specific rules, but I don't know the way.
For example, I have a DataFrame related to chemistry as shown below.
Each row shows the number of chemical bonds in a chemical compound.
OH HO CaO OCa OO NaMg MgNa
0 2 3 2 0 1 1 1
1 0 2 3 4 5 2 0
2 1 2 3 0 0 0 0
In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I'd like to organise the DataFrame as shown below.
OH CaO OO NaMg
0 5 2 1 2
1 2 7 9 2
2 3 3 0 0
I’m struggling because:
- there are a variety of chemical bonds in my real DataFrame, so it is impossible to organise the information one by one (The number of columns is more than 3,000 and I don't know which kinds of chemical bonds exist and are duplicates.)
- the number of letters depends on each element symbol and some symbols include lowercase
(e.g. Hydrogen: H (one letter and only uppercase), Calcium: Ca (Two letters and uppercase & lowercase)
I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.
答案1
得分: 6
你可以使用 str.findall
提取单个元素<strike>并使用 frozenset
</strike>以及对单个元素进行排序以重新组织成对。使用 frozenset
不是一个好的解决方案,因为对于 OO
,第二个元素将丢失。
现在你可以按这些集合进行分组并应用求和:
# 修改自 https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])';
grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x)))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)
输出:
>>> out
CaO HO MgNa OO
0 2 5 2 1
1 7 2 2 5
2 3 3 0 0
英文:
You can use str.findall
to extract individual element <strike>and use frozenset
</strike> and sort individual elements to reorganize the pairs. Using frozenset
is not a good solution because for OO
, the second will be lost.
Now you can group by this sets and apply sum:
# Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])'
grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)
Output:
>>> out
CaO HO MgNa OO
0 2 5 2 1
1 7 2 2 5
2 3 3 0 0
答案2
得分: 4
另一种方法是使用正则表达式和 sorted
:
import re
sorter = lambda x: ''.join(sorted(re.findall('[A-Z][a-z]*', x)))
out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
.sum()
)
输出:
HO CaO OO MgNa
0 5 2 1 2
1 2 7 5 2
2 3 3 0 0
英文:
Another approach using a regex and sorted
:
import re
sorter = lambda x: ''.join(sorted(re.findall('[A-Z][a-z]*', x)))
out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
.sum()
)
Output:
HO CaO OO MgNa
0 5 2 1 2
1 2 7 5 2
2 3 3 0 0
答案3
得分: 1
另一个可能的解决方案:
df.columns = (pd.DataFrame
.from_records([ ''.join(sorted(x)), x] for x in df.columns])
.groupby(0)[1].transform('first').to_list())
df.stack().groupby(level=[0,1]).sum().unstack()
输出:
CaO NaMg OH OO
0 2 2 5 1
1 7 2 2 5
2 3 0 3 0
英文:
Another possible solution:
df.columns = (pd.DataFrame
.from_records([[''.join(sorted(x)), x] for x in df.columns])
.groupby(0)[1].transform('first').to_list())
df.stack().groupby(level=[0,1]).sum().unstack()
Output:
CaO NaMg OH OO
0 2 2 5 1
1 7 2 2 5
2 3 0 3 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论