如何在pandas数据帧中合并两行并将其索引保存在新列中

huangapple go评论64阅读模式
英文:

How to merge two rows in pandas dataframe and save its indexes in a new column

问题

I have translated the code portion for you:

# 创建列以便根据所需条件对行进行分组
df_unique['one_zero'] = df_unique['one_one_3first'] + df_unique['zero_zero_3first']
df_unique['zero_one'] = df_unique['zero_zero_3first'] + df_unique['one_one_3first']

# 将行分组
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final = df_unique.groupby(group, as_index=False).first()

# 尝试使用合并行的总和更新genes_count列
genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({'genes_count': 'sum'})
df_unique_final = df_unique_final.drop(columns=['genes_count', 'one_zero', 'zero_one'])
df_unique_final['genes_count'] = genes_count_in_df_unique_final['genes_count']

Please let me know if you need any further assistance or have additional questions.

英文:

I want to merge rows in my input df_unique IF the list from one_one_3first column is the same as in zero_zero_3first AND inversely too (zero_zero_3first the same as one_one_3first) --> like the 0 and 1 row in the input df.

After merging, I want to receive a list of indexes of merged rows in a new column and update the genes_count column with the sum for merged rows.

To do that, I've created columns one_zero and zero_one to be able to group rows under desired conditions:

# create columns to be able to group rows
df_unique['one_zero'] = df_unique['one_one_3first'] + df_unique['zero_zero_3first']
df_unique['zero_one'] = df_unique['zero_zero_3first'] + df_unique['one_one_3first']

Here is my input df_unique with created columns one_zero and zero_oneto group rows:

                  one_one_3first             zero_zero_3first  genes_count                                                one_zero                                                zero_one
0    ['P1-12', 'P1-25', 'P1-28']  ['P1-22', 'P1-89', 'P1-92']           16  ['P1-12', 'P1-25', 'P1-28']['P1-22', 'P1-89', 'P1-92']  ['P1-22', 'P1-89', 'P1-92']['P1-12', 'P1-25', 'P1-28']
1    ['P1-22', 'P1-89', 'P1-92']  ['P1-12', 'P1-25', 'P1-28']           22  ['P1-22', 'P1-89', 'P1-92']['P1-12', 'P1-25', 'P1-28']  ['P1-12', 'P1-25', 'P1-28']['P1-22', 'P1-89', 'P1-92']
2     ['P1-26', 'P1-6', 'P1-92']  ['P1-12', 'P1-25', 'P1-28']            3   ['P1-26', 'P1-6', 'P1-92']['P1-12', 'P1-25', 'P1-28']   ['P1-12', 'P1-25', 'P1-28']['P1-26', 'P1-6', 'P1-92']
3    ['P1-12', 'P1-26', 'P1-89']  ['P1-25', 'P1-88', 'P1-90']            4  ['P1-12', 'P1-26', 'P1-89']['P1-25', 'P1-88', 'P1-90']  ['P1-25', 'P1-88', 'P1-90']['P1-12', 'P1-26', 'P1-89']

I performed grouping rows under desired conditions but the last three lines with calculating the sum in genes_count column don't work correctly (the order of output records is different than in output and genes count in the updated column for non_merged rows, e.g. 1 and 2, is incorrect):

# group rows
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final = df_unique.groupby(group, as_index=False).first()

# ?try to update genes_count column with the sum for grouped rows?
genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({'genes_count': 'sum'})
df_unique_final = df_unique_final.drop(columns=['genes_count', 'one_zero', 'zero_one'])
df_unique_final['genes_count'] = genes_count_in_df_unique_final['genes_count']

and for that moment the output looks like that:

                  one_one_3first             zero_zero_3first  genes_count
0    ['P1-12', 'P1-25', 'P1-28']  ['P1-22', 'P1-89', 'P1-92']           38
1    ['P1-12', 'P1-90', 'P1-95']  ['P1-26', 'P1-88', 'P1-92']            3
2     ['P1-22', 'P1-6', 'P1-92']  ['P1-28', 'P1-88', 'P1-90']            4

So, my questions are:

  1. what should I change to keep the same order of records in the output as in the input to perform the column genes_count with correct values for every row?

and

  1. how to save the indexes of grouped rows in a new column?

to receive the final output like that:

                  one_one_3first             zero_zero_3first  genes_count    idxs_list
0    ['P1-12', 'P1-25', 'P1-28']  ['P1-22', 'P1-89', 'P1-92']           38          0,1
1     ['P1-26', 'P1-6', 'P1-92']  ['P1-12', 'P1-25', 'P1-28']            3            2
2    ['P1-12', 'P1-26', 'P1-89']  ['P1-25', 'P1-88', 'P1-90']            4            3

? Be grateful for any advice!

答案1

得分: 1

只返回翻译好的部分:

为了保留索引,最简单的方法就是将其转换为常规列,然后进行所需的任何操作。

df = df.reset_index(drop=False).rename(columns={'index': "original_index"})

然后,在进行所有所需的更改时,您可以简单地按该列进行排序,使用 df.sort_values("original_index")

英文:

To keep the indexes, the easiest is just transforming it to a regular column, then do whatever you want.

df = df.reset_index(drop=False).rename(columns={'index': "original_index"})

Then, when you do all the changes you need, you can simply sort by that column with df.sort_values("original_index")

答案2

得分: 0

我是您的中文翻译,以下是您要翻译的代码部分:

# 创建“df_unique_index”列,以便在合并后保留行索引
df_unique = df_unique.reset_index(drop=False).rename(columns={'index': 'df_unique_index'})

# 交叉合并具有相同的1/1和0/0样本,并保留合并行的索引并计算基因总数
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)

df_unique_final_1 = df_unique.groupby(group).agg({'df_unique_index': (lambda x: list(x)), 'one_one_3first': 'first',
                                                  'zero_zero_3first': 'first', 'genes_count': 'sum'}).reset_index()

df_unique_final_1 = df_unique_final_1.drop(columns=['index']).rename(columns={'one_one_3first': 'one_one',
                                                                              'zero_zero_3first': 'zero_zero'}).\
    reindex(columns=['one_one', 'zero_zero', 'df_unique_index', 'genes_count'])

并获得了所需的结果:

                   one_one              zero_zero df_unique_index  genes_count
0    ['P1-12', 'P1-25', 'P1-28']  ['P1-22', 'P1-89', 'P1-92']          [0, 1]           38
1    ['P1-12', 'P1-90', 'P1-95']  ['P1-26', 'P1-88', 'P1-92']           [538]            1
2     ['P1-22', 'P1-6', 'P1-92']  ['P1-28', 'P1-88', 'P1-90']      [539, 812]            9

如果您需要进一步的翻译或有其他问题,请随时提出。

英文:

I dealt with that in this way:

# create column 'df_unique_index' to be able to keep row indexes after merging
df_unique = df_unique.reset_index(drop=False).rename(columns={'index': "df_unique_index"})

# merge records with the same 1/1 and 0/0 samples crosswise with keeping indexes of merged rows and counting the sum of genes
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)

df_unique_final_1 = df_unique.groupby(group).agg({'df_unique_index': (lambda x: list(x)), 'one_one_3first': 'first',
                                                  'zero_zero_3first': 'first','genes_count': 'sum'}).reset_index()

df_unique_final_1 = df_unique_final_1.drop(columns=['index']).rename(columns={'one_one_3first': 'one_one',
                                                                              'zero_zero_3first': 'zero_zero'}).\
    reindex(columns=['one_one', 'zero_zero', 'df_unique_index', 'genes_count'])

and obtained the desired result:

                   one_one              zero_zero df_unique_index  genes_count
0    [P1-12, P1-25, P1-28]  [P1-22, P1-89, P1-92]          [0, 1]           38
1    [P1-12, P1-90, P1-95]  [P1-26, P1-88, P1-92]           [538]            1
2     [P1-22, P1-6, P1-92]  [P1-28, P1-88, P1-90]      [539, 812]            9

huangapple
  • 本文由 发表于 2023年7月3日 22:02:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76605514.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定