英文:
How to merge two rows in pandas dataframe and save its indexes in a new column
问题
I have translated the code portion for you:
# 创建列以便根据所需条件对行进行分组
df_unique['one_zero'] = df_unique['one_one_3first'] + df_unique['zero_zero_3first']
df_unique['zero_one'] = df_unique['zero_zero_3first'] + df_unique['one_one_3first']
# 将行分组
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final = df_unique.groupby(group, as_index=False).first()
# 尝试使用合并行的总和更新genes_count列
genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({'genes_count': 'sum'})
df_unique_final = df_unique_final.drop(columns=['genes_count', 'one_zero', 'zero_one'])
df_unique_final['genes_count'] = genes_count_in_df_unique_final['genes_count']
Please let me know if you need any further assistance or have additional questions.
英文:
I want to merge rows in my input df_unique
IF the list from one_one_3first
column is the same as in zero_zero_3first
AND inversely too (zero_zero_3first
the same as one_one_3first
) --> like the 0 and 1 row in the input df.
After merging, I want to receive a list of indexes of merged rows in a new column and update the genes_count
column with the sum for merged rows.
To do that, I've created columns one_zero
and zero_one
to be able to group rows under desired conditions:
# create columns to be able to group rows
df_unique['one_zero'] = df_unique['one_one_3first'] + df_unique['zero_zero_3first']
df_unique['zero_one'] = df_unique['zero_zero_3first'] + df_unique['one_one_3first']
Here is my input df_unique
with created columns one_zero
and zero_one
to group rows:
one_one_3first zero_zero_3first genes_count one_zero zero_one
0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] 16 ['P1-12', 'P1-25', 'P1-28']['P1-22', 'P1-89', 'P1-92'] ['P1-22', 'P1-89', 'P1-92']['P1-12', 'P1-25', 'P1-28']
1 ['P1-22', 'P1-89', 'P1-92'] ['P1-12', 'P1-25', 'P1-28'] 22 ['P1-22', 'P1-89', 'P1-92']['P1-12', 'P1-25', 'P1-28'] ['P1-12', 'P1-25', 'P1-28']['P1-22', 'P1-89', 'P1-92']
2 ['P1-26', 'P1-6', 'P1-92'] ['P1-12', 'P1-25', 'P1-28'] 3 ['P1-26', 'P1-6', 'P1-92']['P1-12', 'P1-25', 'P1-28'] ['P1-12', 'P1-25', 'P1-28']['P1-26', 'P1-6', 'P1-92']
3 ['P1-12', 'P1-26', 'P1-89'] ['P1-25', 'P1-88', 'P1-90'] 4 ['P1-12', 'P1-26', 'P1-89']['P1-25', 'P1-88', 'P1-90'] ['P1-25', 'P1-88', 'P1-90']['P1-12', 'P1-26', 'P1-89']
I performed grouping rows under desired conditions but the last three lines with calculating the sum in genes_count
column don't work correctly (the order of output records is different than in output and genes count in the updated column for non_merged rows, e.g. 1 and 2, is incorrect):
# group rows
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final = df_unique.groupby(group, as_index=False).first()
# ?try to update genes_count column with the sum for grouped rows?
genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({'genes_count': 'sum'})
df_unique_final = df_unique_final.drop(columns=['genes_count', 'one_zero', 'zero_one'])
df_unique_final['genes_count'] = genes_count_in_df_unique_final['genes_count']
and for that moment the output looks like that:
one_one_3first zero_zero_3first genes_count
0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] 38
1 ['P1-12', 'P1-90', 'P1-95'] ['P1-26', 'P1-88', 'P1-92'] 3
2 ['P1-22', 'P1-6', 'P1-92'] ['P1-28', 'P1-88', 'P1-90'] 4
So, my questions are:
- what should I change to keep the same order of records in the output as in the input to perform the column genes_count with correct values for every row?
and
- how to save the indexes of grouped rows in a new column?
to receive the final output like that:
one_one_3first zero_zero_3first genes_count idxs_list
0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] 38 0,1
1 ['P1-26', 'P1-6', 'P1-92'] ['P1-12', 'P1-25', 'P1-28'] 3 2
2 ['P1-12', 'P1-26', 'P1-89'] ['P1-25', 'P1-88', 'P1-90'] 4 3
? Be grateful for any advice!
答案1
得分: 1
只返回翻译好的部分:
为了保留索引,最简单的方法就是将其转换为常规列,然后进行所需的任何操作。
df = df.reset_index(drop=False).rename(columns={'index': "original_index"})
然后,在进行所有所需的更改时,您可以简单地按该列进行排序,使用 df.sort_values("original_index")
。
英文:
To keep the indexes, the easiest is just transforming it to a regular column, then do whatever you want.
df = df.reset_index(drop=False).rename(columns={'index': "original_index"})
Then, when you do all the changes you need, you can simply sort by that column with df.sort_values("original_index")
答案2
得分: 0
我是您的中文翻译,以下是您要翻译的代码部分:
# 创建“df_unique_index”列,以便在合并后保留行索引
df_unique = df_unique.reset_index(drop=False).rename(columns={'index': 'df_unique_index'})
# 交叉合并具有相同的1/1和0/0样本,并保留合并行的索引并计算基因总数
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final_1 = df_unique.groupby(group).agg({'df_unique_index': (lambda x: list(x)), 'one_one_3first': 'first',
'zero_zero_3first': 'first', 'genes_count': 'sum'}).reset_index()
df_unique_final_1 = df_unique_final_1.drop(columns=['index']).rename(columns={'one_one_3first': 'one_one',
'zero_zero_3first': 'zero_zero'}).\
reindex(columns=['one_one', 'zero_zero', 'df_unique_index', 'genes_count'])
并获得了所需的结果:
one_one zero_zero df_unique_index genes_count
0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] [0, 1] 38
1 ['P1-12', 'P1-90', 'P1-95'] ['P1-26', 'P1-88', 'P1-92'] [538] 1
2 ['P1-22', 'P1-6', 'P1-92'] ['P1-28', 'P1-88', 'P1-90'] [539, 812] 9
如果您需要进一步的翻译或有其他问题,请随时提出。
英文:
I dealt with that in this way:
# create column 'df_unique_index' to be able to keep row indexes after merging
df_unique = df_unique.reset_index(drop=False).rename(columns={'index': "df_unique_index"})
# merge records with the same 1/1 and 0/0 samples crosswise with keeping indexes of merged rows and counting the sum of genes
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final_1 = df_unique.groupby(group).agg({'df_unique_index': (lambda x: list(x)), 'one_one_3first': 'first',
'zero_zero_3first': 'first','genes_count': 'sum'}).reset_index()
df_unique_final_1 = df_unique_final_1.drop(columns=['index']).rename(columns={'one_one_3first': 'one_one',
'zero_zero_3first': 'zero_zero'}).\
reindex(columns=['one_one', 'zero_zero', 'df_unique_index', 'genes_count'])
and obtained the desired result:
one_one zero_zero df_unique_index genes_count
0 [P1-12, P1-25, P1-28] [P1-22, P1-89, P1-92] [0, 1] 38
1 [P1-12, P1-90, P1-95] [P1-26, P1-88, P1-92] [538] 1
2 [P1-22, P1-6, P1-92] [P1-28, P1-88, P1-90] [539, 812] 9
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论