How can I use transform (or other) instead of merge and temporary columns for special indexing/ranking?

huangapple go评论58阅读模式
英文:

How can I use transform (or other) instead of merge and temporary columns for special indexing/ranking?

问题

我有以下的Python代码,它的输出是正确的,但我觉得可以通过删除临时列和消除合并操作,也许使用transform或其他更巧妙的方式来写更符合Pythonic风格的代码。数据已按类别和子类别预先排序。每个组都将具有子_cat值为'c'。想法是输出一个名为'rank'的列,其中子_cat值'c'始终为零,并通过整数递增/递减计数到组的其余部分,在每个组中重置。cat_index和c_idx(表示组内'c'的索引)将被丢弃。

以下是您的工作代码:

import pandas as pd
import io

data = '''category,sub_cat
A,a
A,c
A,d
B,c
B,d
D,a
D,b
D,c
G,b
G,c
G,e
G,f
G,h
'''

df = pd.read_csv(io.StringIO(data), sep=',')
df['cat_index'] = df.groupby('category').cumcount()
c_idx = df.groupby('category').apply(lambda x: x.loc[x['sub_cat'] == 'c', 'cat_index'].values[0]).to_frame('c_idx')
df = df.merge(c_idx, how='left', on='category')
df['rank'] = df['cat_index'] - df['c_idx']

以下是正确的输出:
How can I use transform (or other) instead of merge and temporary columns for special indexing/ranking?

再次强调,'rank'是从'category'和'sub_cat'输入得到的期望输出。'cat_index'和'c_idx'列是临时的,将被删除。

有任何想法吗?请告诉我。谢谢。

英文:

I have the following python code which outputs correctly, but I sense the code could be more pythonic by removing temporary columns and removing the merge perhaps with transform or other clever way. The data is presorted by category then subcategory. Each group will have a will a sub_cat value of 'c'. The idea is to output a rank column where sub_cat value 'c' is always zero and counts by integers up/down through the rest of the group, resetting at each group. The cat_index, and c_idx (meaning the index of 'c' within the group) will be disposed of.

Here is my working code:

import pandas as pd
import io

data = '''
category,sub_cat
A,a
A,c
A,d
B,c
B,d
D,a
D,b
D,c
G,b
G,c
G,e
G,f
G,h
'''

df = pd.read_csv(io.StringIO(data), sep=',')
df['cat_index'] = df.groupby('category').cumcount()
c_idx = df.groupby('category').apply(lambda x: x.loc[x['sub_cat'] == 'c', 'cat_index'].values[0]).to_frame('c_idx')
df = df.merge(c_idx, how='left', on='category')
df['rank'] = df['cat_index'] - df['c_idx']

And here is the correct output:
How can I use transform (or other) instead of merge and temporary columns for special indexing/ranking?

Again, 'rank' is the desired output from inputs 'category', and 'sub_cat'. Columns 'cat_index' and 'c_idx' are temporary and will be dropped.

Any ideas? Let me know. Thanks.

答案1

得分: 1

尝试:

df['rank'] = df.groupby('category')['sub_cat'].transform(lambda x: np.arange(len(x)) - ((x == 'c').idxmax() - x.index[0]))
print(df)

输出:

   category sub_cat  rank
0         A       a    -1
1         A       c     0
2         A       d     1
3         B       c     0
4         B       d     1
5         D       a    -2
6         D       b    -1
7         D       c     0
8         G       b    -1
9         G       c     0
10        G       e     1
11        G       f     2
12        G       h     3
英文:

Try:

df['rank'] = df.groupby('category')['sub_cat'].transform(lambda x: np.arange(len(x)) - ((x == 'c').idxmax() - x.index[0]))
print(df)

Prints:

   category sub_cat  rank
0         A       a    -1
1         A       c     0
2         A       d     1
3         B       c     0
4         B       d     1
5         D       a    -2
6         D       b    -1
7         D       c     0
8         G       b    -1
9         G       c     0
10        G       e     1
11        G       f     2
12        G       h     3

huangapple
  • 本文由 发表于 2023年6月8日 03:49:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76426683.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定