英文:
Python pandas keep first columns' order unchanged while second col sort by ascending order
问题
抱歉,我不能执行代码。以下是您要求的翻译:
"Hi I want to keep the column infoid order unchanged but sort date in increasing order(acsending)
Is that possible?
statisticsdate infoid
20230108 46726004
20230106 46726004
20230108 46725082
20230107 46725082
20230108 46725081
20230108 46724162
20230108 46720662
should be like:
statisticsdate infoid
20230106 46726004
20230108 46726004
20230107 46725082
20230108 46725082
20230108 46725081
20230108 46724162
20230108 46720662"
英文:
Hi I want to keep the column infoid order unchanged but sort date in increasing order(acsending)
Is that possible?
statisticsdate infoid
20230108 46726004
20230106 46726004
20230108 46725082
20230107 46725082
20230108 46725081
20230108 46724162
20230108 46720662
should be like:
statisticsdate infoid
20230106 46726004
20230108 46726004
20230107 46725082
20230108 46725082
20230108 46725081
20230108 46724162
20230108 46720662
答案1
得分: 2
如果可能的话,按照 infoid
降序和 infoid
升序使用 DataFrame.sort_values
进行排序:
df = df.sort_values(['infoid', 'statisticsdate'], ascending=[False, True], ignore_index=True)
print(df)
statisticsdate infoid
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
第一个方法是根据自定义 Lambda 函数按组进行排序,使用 sort=False
参数来不对组进行排序,如果 DataFrame 较大,则该解决方案较慢:
df['statisticsdate'] = (df.groupby('infoid', sort=False, group_keys=False)['statisticsdate']
.apply(lambda x: x.sort_values())
.to_numpy())
print(df)
statisticsdate infoid
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
或者,您可以将 infoid
转换为有序的分类并按两列进行排序:
df['statisticsdate'] = (df.assign(infoid=pd.Categorical(df['infoid'], ordered=True, categories=df['infoid'].unique()))
.sort_values(['infoid', 'statisticsdate']))['statisticsdate']
print(df)
statisticsdate infoid
0 20230108 46726004
1 20230106 46726004
2 20230108 46725082
3 20230107 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
如果某些组未排序,则需要将 infoid
转换为连续的组,然后按两列进行排序:
print(df)
statisticsdate infoid
0 20230108 46726004
1 20230106 46726004
2 20230108 46725082
3 20230107 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
7 20230108 46726004 <- 未排序的组 46726004
8 20230106 46726004
df['statisticsdate'] = (df.assign(infoid=df['infoid'].ne(df['infoid'].shift()).cumsum())
.sort_values(['infoid', 'statisticsdate'], ignore_index=True)['statisticsdate'])
print(df)
statisticsdate infoid
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
7 20230106 46726004
8 20230108 46726004
英文:
If possible sorting infoid
descending and infoid
ascending use DataFrame.sort_values
only:
df = df.sort_values(['infoid','statisticsdate'], ascending=[False, True], ignore_index=True)
print (df)
statisticsdate infoid
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
First idea is sorting per groups by custom lambda function with sort=False
parameter for no sorting groups, solution is slow if larger DataFrame:
df['statisticsdate'] = (df.groupby('infoid', sort=False, group_keys=False)['statisticsdate']
.apply(lambda x: x.sort_values())
.to_numpy())
print (df)
statisticsdate infoid
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
Or you can convert infoid
to ordered Categorical and sorting by both columns:
df['statisticsdate'] = (df.assign(infoid = pd.Categorical(df['infoid'],
ordered=True,
categories=df['infoid'].unique()))
.sort_values(['infoid','statisticsdate']))['statisticsdate']
print (df)
statisticsdate infoid
0 20230108 46726004
1 20230106 46726004
2 20230108 46725082
3 20230107 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
If some groups are not sorted is necessary convert infoid
to consecutive groups and then sorting by both columns:
print (df)
statisticsdate infoid
0 20230108 46726004
1 20230106 46726004
2 20230108 46725082
3 20230107 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
7 20230108 46726004 <- not sorted group 46726004
8 20230106 46726004
df['statisticsdate'] = (df.assign(infoid = df['infoid'].ne(df['infoid'].shift()).cumsum())
.sort_values(['infoid','statisticsdate'], ignore_index=True)['statisticsdate'])
print (df)
statisticsdate infoid
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
7 20230106 46726004
8 20230108 46726004
答案2
得分: 2
这应该可以完成任务:
>>> df.groupby('id').transform(lambda x: x.sort_values()).join(df['id'])
date id
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
或者不使用 join:
>>> df.set_index('id').groupby('id').transform(lambda x: x.sort_values()).reset_index()
id date
0 46726004 20230106
1 46726004 20230108
2 46725082 20230107
3 46725082 20230108
4 46725081 20230108
5 46724162 20230108
6 46720662 20230108
英文:
This should do the trick:
>>> df.groupby('id').transform(lambda x: x.sort_values()).join(df['id'])
date id
0 20230106 46726004
1 20230108 46726004
2 20230107 46725082
3 20230108 46725082
4 20230108 46725081
5 20230108 46724162
6 20230108 46720662
Or without a join:
>>> df.set_index('id').groupby('id').transform(lambda x: x.sort_values()).reset_index()
id date
0 46726004 20230106
1 46726004 20230108
2 46725082 20230107
3 46725082 20230108
4 46725081 20230108
5 46724162 20230108
6 46720662 20230108
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论