2023年7月3日 22:02:34go评论91阅读模式

英文:

How to merge two rows in pandas dataframe and save its indexes in a new column

问题

I have translated the code portion for you:

# 创建列以便根据所需条件对行进行分组
df_unique['one_zero'] = df_unique['one_one_3first'] + df_unique['zero_zero_3first']
df_unique['zero_one'] = df_unique['zero_zero_3first'] + df_unique['one_one_3first']
# 将行分组
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final = df_unique.groupby(group, as_index=False).first()
# 尝试使用合并行的总和更新genes_count列
genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({'genes_count': 'sum'})
df_unique_final = df_unique_final.drop(columns=['genes_count', 'one_zero', 'zero_one'])
df_unique_final['genes_count'] = genes_count_in_df_unique_final['genes_count']

Please let me know if you need any further assistance or have additional questions.

英文:

I want to merge rows in my input df_unique IF the list from one_one_3first column is the same as in zero_zero_3first AND inversely too (zero_zero_3first the same as one_one_3first) --> like the 0 and 1 row in the input df.

After merging, I want to receive a list of indexes of merged rows in a new column and update the genes_count column with the sum for merged rows.

To do that, I've created columns one_zero and zero_one to be able to group rows under desired conditions:

# create columns to be able to group rows
df_unique[&#39;one_zero&#39;] = df_unique[&#39;one_one_3first&#39;] + df_unique[&#39;zero_zero_3first&#39;]
df_unique[&#39;zero_one&#39;] = df_unique[&#39;zero_zero_3first&#39;] + df_unique[&#39;one_one_3first&#39;]

Here is my input df_unique with created columns one_zero and zero_oneto group rows:

                  one_one_3first             zero_zero_3first  genes_count                                                one_zero                                                zero_one
0    [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]  [&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;]           16  [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;][&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;]  [&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;][&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]
1    [&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;]  [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]           22  [&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;][&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]  [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;][&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;]
2     [&#39;P1-26&#39;, &#39;P1-6&#39;, &#39;P1-92&#39;]  [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]            3   [&#39;P1-26&#39;, &#39;P1-6&#39;, &#39;P1-92&#39;][&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]   [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;][&#39;P1-26&#39;, &#39;P1-6&#39;, &#39;P1-92&#39;]
3    [&#39;P1-12&#39;, &#39;P1-26&#39;, &#39;P1-89&#39;]  [&#39;P1-25&#39;, &#39;P1-88&#39;, &#39;P1-90&#39;]            4  [&#39;P1-12&#39;, &#39;P1-26&#39;, &#39;P1-89&#39;][&#39;P1-25&#39;, &#39;P1-88&#39;, &#39;P1-90&#39;]  [&#39;P1-25&#39;, &#39;P1-88&#39;, &#39;P1-90&#39;][&#39;P1-12&#39;, &#39;P1-26&#39;, &#39;P1-89&#39;]

I performed grouping rows under desired conditions but the last three lines with calculating the sum in genes_count column don't work correctly (the order of output records is different than in output and genes count in the updated column for non_merged rows, e.g. 1 and 2, is incorrect):

# group rows
group = df_unique[[&#39;one_zero&#39;, &#39;zero_one&#39;]].apply(frozenset, axis=1)
df_unique_final = df_unique.groupby(group, as_index=False).first()
# ?try to update genes_count column with the sum for grouped rows?
genes_count_in_df_unique_final = df_unique.groupby(group, as_index=False, sort=False).agg({&#39;genes_count&#39;: &#39;sum&#39;})
df_unique_final = df_unique_final.drop(columns=[&#39;genes_count&#39;, &#39;one_zero&#39;, &#39;zero_one&#39;])
df_unique_final[&#39;genes_count&#39;] = genes_count_in_df_unique_final[&#39;genes_count&#39;]

and for that moment the output looks like that:

                  one_one_3first             zero_zero_3first  genes_count
0    [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]  [&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;]           38
1    [&#39;P1-12&#39;, &#39;P1-90&#39;, &#39;P1-95&#39;]  [&#39;P1-26&#39;, &#39;P1-88&#39;, &#39;P1-92&#39;]            3
2     [&#39;P1-22&#39;, &#39;P1-6&#39;, &#39;P1-92&#39;]  [&#39;P1-28&#39;, &#39;P1-88&#39;, &#39;P1-90&#39;]            4

So, my questions are:

what should I change to keep the same order of records in the output as in the input to perform the column genes_count with correct values for every row?

and

how to save the indexes of grouped rows in a new column?

to receive the final output like that:

                  one_one_3first             zero_zero_3first  genes_count    idxs_list
0    [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]  [&#39;P1-22&#39;, &#39;P1-89&#39;, &#39;P1-92&#39;]           38          0,1
1     [&#39;P1-26&#39;, &#39;P1-6&#39;, &#39;P1-92&#39;]  [&#39;P1-12&#39;, &#39;P1-25&#39;, &#39;P1-28&#39;]            3            2
2    [&#39;P1-12&#39;, &#39;P1-26&#39;, &#39;P1-89&#39;]  [&#39;P1-25&#39;, &#39;P1-88&#39;, &#39;P1-90&#39;]            4            3

? Be grateful for any advice!

答案1

得分: 1

只返回翻译好的部分：

为了保留索引，最简单的方法就是将其转换为常规列，然后进行所需的任何操作。

df = df.reset_index(drop=False).rename(columns={'index': "original_index"})

然后，在进行所有所需的更改时，您可以简单地按该列进行排序，使用 df.sort_values("original_index")。

英文:

To keep the indexes, the easiest is just transforming it to a regular column, then do whatever you want.

df = df.reset_index(drop=False).rename(columns={&#39;index&#39;: &quot;original_index&quot;})

Then, when you do all the changes you need, you can simply sort by that column with df.sort_values("original_index")

答案2

得分: 0

我是您的中文翻译，以下是您要翻译的代码部分：

# 创建“df_unique_index”列，以便在合并后保留行索引
df_unique = df_unique.reset_index(drop=False).rename(columns={'index': 'df_unique_index'})
# 交叉合并具有相同的1/1和0/0样本，并保留合并行的索引并计算基因总数
group = df_unique[['one_zero', 'zero_one']].apply(frozenset, axis=1)
df_unique_final_1 = df_unique.groupby(group).agg({'df_unique_index': (lambda x: list(x)), 'one_one_3first': 'first',
                                                  'zero_zero_3first': 'first', 'genes_count': 'sum'}).reset_index()
df_unique_final_1 = df_unique_final_1.drop(columns=['index']).rename(columns={'one_one_3first': 'one_one',
                                                                              'zero_zero_3first': 'zero_zero'}).\
    reindex(columns=['one_one', 'zero_zero', 'df_unique_index', 'genes_count'])

并获得了所需的结果：

                   one_one              zero_zero df_unique_index  genes_count
0    ['P1-12', 'P1-25', 'P1-28']  ['P1-22', 'P1-89', 'P1-92']          [0, 1]           38
1    ['P1-12', 'P1-90', 'P1-95']  ['P1-26', 'P1-88', 'P1-92']           [538]            1
2     ['P1-22', 'P1-6', 'P1-92']  ['P1-28', 'P1-88', 'P1-90']      [539, 812]            9

如果您需要进一步的翻译或有其他问题，请随时提出。

英文:

I dealt with that in this way:

# create column &#39;df_unique_index&#39; to be able to keep row indexes after merging
df_unique = df_unique.reset_index(drop=False).rename(columns={&#39;index&#39;: &quot;df_unique_index&quot;})
# merge records with the same 1/1 and 0/0 samples crosswise with keeping indexes of merged rows and counting the sum of genes
group = df_unique[[&#39;one_zero&#39;, &#39;zero_one&#39;]].apply(frozenset, axis=1)
df_unique_final_1 = df_unique.groupby(group).agg({&#39;df_unique_index&#39;: (lambda x: list(x)), &#39;one_one_3first&#39;: &#39;first&#39;,
                                                  &#39;zero_zero_3first&#39;: &#39;first&#39;,&#39;genes_count&#39;: &#39;sum&#39;}).reset_index()
df_unique_final_1 = df_unique_final_1.drop(columns=[&#39;index&#39;]).rename(columns={&#39;one_one_3first&#39;: &#39;one_one&#39;,
                                                                              &#39;zero_zero_3first&#39;: &#39;zero_zero&#39;}).\
    reindex(columns=[&#39;one_one&#39;, &#39;zero_zero&#39;, &#39;df_unique_index&#39;, &#39;genes_count&#39;])

and obtained the desired result:

                   one_one              zero_zero df_unique_index  genes_count
0    [P1-12, P1-25, P1-28]  [P1-22, P1-89, P1-92]          [0, 1]           38
1    [P1-12, P1-90, P1-95]  [P1-26, P1-88, P1-92]           [538]            1
2     [P1-22, P1-6, P1-92]  [P1-28, P1-88, P1-90]      [539, 812]            9

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在pandas数据帧中合并两行并将其索引保存在新列中

问题

答案1

答案2

在二维列表中搜索（位置）

Is it possible to utilize a CLI (module/app/library) for use in my own Python script?

Pandas不对两个数值列中的值求和。

如何让Keras在GPU上运行？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。