如何在pandas数据帧中合并两行并将其索引保存在新列中

huangapple go评论91阅读模式
英文:

How to merge two rows in pandas dataframe and save its indexes in a new column

问题

I have translated the code portion for you:

  1. # 创建列以便根据所需条件对行进行分组
  2. df_unique['one_zero'] = df_unique['one_one_3first'] + df_unique['zero_zero_3first']
  3. df_unique['zero_one'] = df_unique['zero_zero_3first'] + df_unique['one_one_3first']
  4. # 将行分组
  5. group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
  6. df_unique_final = df_unique.groupby(group, as_index=False).first()
  7. # 尝试使用合并行的总和更新genes_count列
  8. genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({'genes_count': 'sum'})
  9. df_unique_final = df_unique_final.drop(columns=['genes_count', 'one_zero', 'zero_one'])
  10. df_unique_final['genes_count'] = genes_count_in_df_unique_final['genes_count']

Please let me know if you need any further assistance or have additional questions.

英文:

I want to merge rows in my input df_unique IF the list from one_one_3first column is the same as in zero_zero_3first AND inversely too (zero_zero_3first the same as one_one_3first) --> like the 0 and 1 row in the input df.

After merging, I want to receive a list of indexes of merged rows in a new column and update the genes_count column with the sum for merged rows.

To do that, I've created columns one_zero and zero_one to be able to group rows under desired conditions:

  1. # create columns to be able to group rows
  2. df_unique['one_zero'] = df_unique['one_one_3first'] + df_unique['zero_zero_3first']
  3. df_unique['zero_one'] = df_unique['zero_zero_3first'] + df_unique['one_one_3first']

Here is my input df_unique with created columns one_zero and zero_oneto group rows:

  1. one_one_3first zero_zero_3first genes_count one_zero zero_one
  2. 0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] 16 ['P1-12', 'P1-25', 'P1-28']['P1-22', 'P1-89', 'P1-92'] ['P1-22', 'P1-89', 'P1-92']['P1-12', 'P1-25', 'P1-28']
  3. 1 ['P1-22', 'P1-89', 'P1-92'] ['P1-12', 'P1-25', 'P1-28'] 22 ['P1-22', 'P1-89', 'P1-92']['P1-12', 'P1-25', 'P1-28'] ['P1-12', 'P1-25', 'P1-28']['P1-22', 'P1-89', 'P1-92']
  4. 2 ['P1-26', 'P1-6', 'P1-92'] ['P1-12', 'P1-25', 'P1-28'] 3 ['P1-26', 'P1-6', 'P1-92']['P1-12', 'P1-25', 'P1-28'] ['P1-12', 'P1-25', 'P1-28']['P1-26', 'P1-6', 'P1-92']
  5. 3 ['P1-12', 'P1-26', 'P1-89'] ['P1-25', 'P1-88', 'P1-90'] 4 ['P1-12', 'P1-26', 'P1-89']['P1-25', 'P1-88', 'P1-90'] ['P1-25', 'P1-88', 'P1-90']['P1-12', 'P1-26', 'P1-89']

I performed grouping rows under desired conditions but the last three lines with calculating the sum in genes_count column don't work correctly (the order of output records is different than in output and genes count in the updated column for non_merged rows, e.g. 1 and 2, is incorrect):

  1. # group rows
  2. group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
  3. df_unique_final = df_unique.groupby(group, as_index=False).first()
  4. # ?try to update genes_count column with the sum for grouped rows?
  5. genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({'genes_count': 'sum'})
  6. df_unique_final = df_unique_final.drop(columns=['genes_count', 'one_zero', 'zero_one'])
  7. df_unique_final['genes_count'] = genes_count_in_df_unique_final['genes_count']

and for that moment the output looks like that:

  1. one_one_3first zero_zero_3first genes_count
  2. 0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] 38
  3. 1 ['P1-12', 'P1-90', 'P1-95'] ['P1-26', 'P1-88', 'P1-92'] 3
  4. 2 ['P1-22', 'P1-6', 'P1-92'] ['P1-28', 'P1-88', 'P1-90'] 4

So, my questions are:

  1. what should I change to keep the same order of records in the output as in the input to perform the column genes_count with correct values for every row?

and

  1. how to save the indexes of grouped rows in a new column?

to receive the final output like that:

  1. one_one_3first zero_zero_3first genes_count idxs_list
  2. 0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] 38 0,1
  3. 1 ['P1-26', 'P1-6', 'P1-92'] ['P1-12', 'P1-25', 'P1-28'] 3 2
  4. 2 ['P1-12', 'P1-26', 'P1-89'] ['P1-25', 'P1-88', 'P1-90'] 4 3

? Be grateful for any advice!

答案1

得分: 1

只返回翻译好的部分:

为了保留索引,最简单的方法就是将其转换为常规列,然后进行所需的任何操作。

  1. df = df.reset_index(drop=False).rename(columns={'index': "original_index"})

然后,在进行所有所需的更改时,您可以简单地按该列进行排序,使用 df.sort_values("original_index")

英文:

To keep the indexes, the easiest is just transforming it to a regular column, then do whatever you want.

  1. df = df.reset_index(drop=False).rename(columns={'index': "original_index"})

Then, when you do all the changes you need, you can simply sort by that column with df.sort_values("original_index")

答案2

得分: 0

我是您的中文翻译,以下是您要翻译的代码部分:

  1. # 创建“df_unique_index”列,以便在合并后保留行索引
  2. df_unique = df_unique.reset_index(drop=False).rename(columns={'index': 'df_unique_index'})
  3. # 交叉合并具有相同的1/1和0/0样本,并保留合并行的索引并计算基因总数
  4. group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
  5. df_unique_final_1 = df_unique.groupby(group).agg({'df_unique_index': (lambda x: list(x)), 'one_one_3first': 'first',
  6. 'zero_zero_3first': 'first', 'genes_count': 'sum'}).reset_index()
  7. df_unique_final_1 = df_unique_final_1.drop(columns=['index']).rename(columns={'one_one_3first': 'one_one',
  8. 'zero_zero_3first': 'zero_zero'}).\
  9. reindex(columns=['one_one', 'zero_zero', 'df_unique_index', 'genes_count'])

并获得了所需的结果:

  1. one_one zero_zero df_unique_index genes_count
  2. 0 ['P1-12', 'P1-25', 'P1-28'] ['P1-22', 'P1-89', 'P1-92'] [0, 1] 38
  3. 1 ['P1-12', 'P1-90', 'P1-95'] ['P1-26', 'P1-88', 'P1-92'] [538] 1
  4. 2 ['P1-22', 'P1-6', 'P1-92'] ['P1-28', 'P1-88', 'P1-90'] [539, 812] 9

如果您需要进一步的翻译或有其他问题,请随时提出。

英文:

I dealt with that in this way:

  1. # create column 'df_unique_index' to be able to keep row indexes after merging
  2. df_unique = df_unique.reset_index(drop=False).rename(columns={'index': "df_unique_index"})
  3. # merge records with the same 1/1 and 0/0 samples crosswise with keeping indexes of merged rows and counting the sum of genes
  4. group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
  5. df_unique_final_1 = df_unique.groupby(group).agg({'df_unique_index': (lambda x: list(x)), 'one_one_3first': 'first',
  6. 'zero_zero_3first': 'first','genes_count': 'sum'}).reset_index()
  7. df_unique_final_1 = df_unique_final_1.drop(columns=['index']).rename(columns={'one_one_3first': 'one_one',
  8. 'zero_zero_3first': 'zero_zero'}).\
  9. reindex(columns=['one_one', 'zero_zero', 'df_unique_index', 'genes_count'])

and obtained the desired result:

  1. one_one zero_zero df_unique_index genes_count
  2. 0 [P1-12, P1-25, P1-28] [P1-22, P1-89, P1-92] [0, 1] 38
  3. 1 [P1-12, P1-90, P1-95] [P1-26, P1-88, P1-92] [538] 1
  4. 2 [P1-22, P1-6, P1-92] [P1-28, P1-88, P1-90] [539, 812] 9

huangapple
  • 本文由 发表于 2023年7月3日 22:02:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76605514.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定