Python pandas 保持第一列的顺序不变,同时按升序对第二列进行排序

huangapple go评论89阅读模式
英文:

Python pandas keep first columns' order unchanged while second col sort by ascending order

问题

抱歉,我不能执行代码。以下是您要求的翻译:

"Hi I want to keep the column infoid order unchanged but sort date in increasing order(acsending)
Is that possible?

statisticsdate infoid
20230108 46726004
20230106 46726004
20230108 46725082
20230107 46725082
20230108 46725081
20230108 46724162
20230108 46720662

should be like:

statisticsdate infoid
20230106 46726004
20230108 46726004
20230107 46725082
20230108 46725082
20230108 46725081
20230108 46724162
20230108 46720662"

英文:

Hi I want to keep the column infoid order unchanged but sort date in increasing order(acsending)
Is that possible?

statisticsdate  infoid 
20230108      46726004
20230106      46726004
20230108      46725082
20230107      46725082
20230108      46725081
20230108      46724162
20230108      46720662 

should be like:

statisticsdate  infoid 
20230106      46726004
20230108      46726004
20230107      46725082
20230108      46725082
20230108      46725081
20230108      46724162
20230108      46720662 

答案1

得分: 2

如果可能的话,按照 infoid 降序和 infoid 升序使用 DataFrame.sort_values 进行排序:

df = df.sort_values(['infoid', 'statisticsdate'], ascending=[False, True], ignore_index=True)
print(df)
   statisticsdate    infoid
0        20230106  46726004
1        20230108  46726004
2        20230107  46725082
3        20230108  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662

第一个方法是根据自定义 Lambda 函数按组进行排序,使用 sort=False 参数来不对组进行排序,如果 DataFrame 较大,则该解决方案较慢:

df['statisticsdate'] = (df.groupby('infoid', sort=False, group_keys=False)['statisticsdate']
                          .apply(lambda x: x.sort_values())
                          .to_numpy())
print(df)
   statisticsdate    infoid
0        20230106  46726004
1        20230108  46726004
2        20230107  46725082
3        20230108  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662

或者,您可以将 infoid 转换为有序的分类并按两列进行排序:

df['statisticsdate'] = (df.assign(infoid=pd.Categorical(df['infoid'], ordered=True, categories=df['infoid'].unique()))
                          .sort_values(['infoid', 'statisticsdate']))['statisticsdate']
print(df)
   statisticsdate    infoid
0        20230108  46726004
1        20230106  46726004
2        20230108  46725082
3        20230107  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662

如果某些组未排序,则需要将 infoid 转换为连续的组,然后按两列进行排序:

print(df)
   statisticsdate    infoid
0        20230108  46726004
1        20230106  46726004
2        20230108  46725082
3        20230107  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662
7        20230108  46726004 <- 未排序的组 46726004
8        20230106  46726004

df['statisticsdate'] = (df.assign(infoid=df['infoid'].ne(df['infoid'].shift()).cumsum())
              .sort_values(['infoid', 'statisticsdate'], ignore_index=True)['statisticsdate'])
print(df)
   statisticsdate    infoid
0        20230106  46726004
1        20230108  46726004
2        20230107  46725082
3        20230108  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662
7        20230106  46726004
8        20230108  46726004
英文:

If possible sorting infoid descending and infoid ascending use DataFrame.sort_values only:

df = df.sort_values([&#39;infoid&#39;,&#39;statisticsdate&#39;], ascending=[False, True], ignore_index=True)
print (df)
   statisticsdate    infoid
0        20230106  46726004
1        20230108  46726004
2        20230107  46725082
3        20230108  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662

First idea is sorting per groups by custom lambda function with sort=False parameter for no sorting groups, solution is slow if larger DataFrame:

df[&#39;statisticsdate&#39;] = (df.groupby(&#39;infoid&#39;, sort=False, group_keys=False)[&#39;statisticsdate&#39;]
                          .apply(lambda x: x.sort_values())
                          .to_numpy())
print (df)
   statisticsdate    infoid
0        20230106  46726004
1        20230108  46726004
2        20230107  46725082
3        20230108  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662

Or you can convert infoid to ordered Categorical and sorting by both columns:

df[&#39;statisticsdate&#39;] = (df.assign(infoid = pd.Categorical(df[&#39;infoid&#39;],
                                                          ordered=True, 
                                                          categories=df[&#39;infoid&#39;].unique()))
                          .sort_values([&#39;infoid&#39;,&#39;statisticsdate&#39;]))[&#39;statisticsdate&#39;]
print (df)
   statisticsdate    infoid
0        20230108  46726004
1        20230106  46726004
2        20230108  46725082
3        20230107  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662

If some groups are not sorted is necessary convert infoid to consecutive groups and then sorting by both columns:

print (df)
   statisticsdate    infoid
0        20230108  46726004
1        20230106  46726004
2        20230108  46725082
3        20230107  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662
7        20230108  46726004 &lt;- not sorted group 46726004
8        20230106  46726004


df[&#39;statisticsdate&#39;] = (df.assign(infoid = df[&#39;infoid&#39;].ne(df[&#39;infoid&#39;].shift()).cumsum())
              .sort_values([&#39;infoid&#39;,&#39;statisticsdate&#39;], ignore_index=True)[&#39;statisticsdate&#39;])
print (df)
   statisticsdate    infoid
0        20230106  46726004
1        20230108  46726004
2        20230107  46725082
3        20230108  46725082
4        20230108  46725081
5        20230108  46724162
6        20230108  46720662
7        20230106  46726004
8        20230108  46726004

答案2

得分: 2

这应该可以完成任务:

&gt;&gt;&gt; df.groupby('id').transform(lambda x: x.sort_values()).join(df['id'])

       date        id
0  20230106  46726004
1  20230108  46726004
2  20230107  46725082
3  20230108  46725082
4  20230108  46725081
5  20230108  46724162
6  20230108  46720662

或者不使用 join:

&gt;&gt;&gt; df.set_index('id').groupby('id').transform(lambda x: x.sort_values()).reset_index()

         id      date
0  46726004  20230106
1  46726004  20230108
2  46725082  20230107
3  46725082  20230108
4  46725081  20230108
5  46724162  20230108
6  46720662  20230108
英文:

This should do the trick:

&gt;&gt;&gt; df.groupby(&#39;id&#39;).transform(lambda x: x.sort_values()).join(df[&#39;id&#39;])

       date        id
0  20230106  46726004
1  20230108  46726004
2  20230107  46725082
3  20230108  46725082
4  20230108  46725081
5  20230108  46724162
6  20230108  46720662

Or without a join:

&gt;&gt;&gt; df.set_index(&#39;id&#39;).groupby(&#39;id&#39;).transform(lambda x: x.sort_values()).reset_index()

         id      date
0  46726004  20230106
1  46726004  20230108
2  46725082  20230107
3  46725082  20230108
4  46725081  20230108
5  46724162  20230108
6  46720662  20230108

huangapple
  • 本文由 发表于 2023年1月9日 14:32:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75053839.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定