2023年6月16日 04:29:32go评论107阅读模式

英文:

Python: How to create a unique dataframe from multiple dataframes with the same dimensions

问题

给定一个带有指定列名的空数据框：

colnames = ('ACCT', 'CTAT', 'AAAT', 'ATCG')*3
df = pd.DataFrame(columns=colnames)

我想要循环遍历具有以下结构的数据框（以2个示例为例）：

sample_df = pd.DataFrame()
sample_df['tetran'] = colnames
sample_df['Frequency'] = (423, 512, 25, 123, 632, 124, 614, 73, 14, 75, 311, 155)
conids = ("cl1_42", "cl1_41", "cl2_31")
rep_conids = [val for val in conids for _ in range(4)]
sample_df['contig_id'] = rep_conids
sample_df_2 = pd.DataFrame()
sample_df_2['tetran'] = colnames
sample_df_2['Frequency'] = (724, 132, 4, 102, 423, 402, 616, 734, 153, 751, 31, 55)
conids_2 = ("se1_51", "se1_21", "se2_53")
rep_conids_2 = [val for val in conids_2 for _ in range(4)]
sample_df_2['contig_id'] = rep_conids_2

目标是：

将'sample_df'中的每个'Frequency'值添加到'df'的相应'tetraN'值中，并添加一个新列作为'sample_df'的['contig_id']。

有多个'sample_df'数据框，所以期望输出如下：

index	ACCT	CTAT	AAAT	ATCG
cl1_42	423	512	25	123
cl1_41	632	124	614	73
cl2_31	14	75	311	155
se1_51	724	132	4	102
se1_21	423	402	616	734
se2_53	153	751	31	55

英文:

given an empty dataframe with assigned column names :

colnames = (&#39;ACCT&#39;, &#39;CTAT&#39;, &#39;AAAT&#39;, &#39;ATCG&#39;)*3
df = pd.DataFrame(columns=colnames)

I want to loop over dataframes which have the below structure: (giving 2 for demostration)

sample_df = pd.DataFrame()
sample_df[&#39;tetran&#39;] = colnames
sample_df[&#39;Frequency&#39;] = (423, 512, 25, 123,632,124,614,73,14,75,311,155)
conids = (&quot;cl1_42&quot;, &quot;cl1_41&quot;, &quot;cl2_31&quot;)
rep_conids = [val for val in conids for _ in range(4)]
sample_df[&#39;contig_id&#39;] = rep_conids
sample_df_2 = pd.DataFrame()
sample_df_2[&#39;tetran&#39;] = colnames
sample_df_2[&#39;Frequency&#39;] = (724, 132, 4, 102,423,402,616,734,153,751,31,55)
conids_2 = (&quot;se1_51&quot;, &quot;se1_21&quot;, &quot;se2_53&quot;)
rep_conids_2 = [val for val in conids_2 for _ in range(4)]
sample_df_2[&#39;contig_id&#39;] = rep_conids_2

The objective is:

Add each 'Frequency' value from the 'sample_df's to the corresponding 'tetraN' value of the 'df' and add a new column to be the sample_df['contig_id']

There are multiple 'sample_df' dataframes , so this is the idea of the desired output:

index	ACCT	CTAT	AAAT	ATCG
cl1_42	423	512	25	123
cl1_41	632	124	614	73
cl2_31	14	75	311	155
se1_51	724	132	4	102
se1_21	423	402	616	734
se2_53	153	751	31	55

I know how to do this in R but I need this to be done in python so I cannot add here what I tried due it is in R.

Thanks for your time

答案1

得分: 1

首先，concat你的数据框，然后 pivot 它们：

out = (pd.concat([sample_df, sample_df_2])
         .pivot(index='contig_id', columns='tetran', values='Frequency'))
print(out)
# 输出
tetran     AAAT  ACCT  ATCG  CTAT
contig_id                        
cl1_41      614   632    73   124
cl1_42       25   423   123   512
cl2_31      311    14   155    75
se1_21      616   423   734   402
se1_51        4   724   102   132
se2_53       31   153    55   751

如果你不想数据排序，可以使用 pivot_table：

out = (pd.concat([sample_df, sample_df_2])
         .pivot_table(index='contig_id', columns='tetran', values='Frequency', sort=False))
print(out)
# 输出
tetran     ACCT  CTAT  AAAT  ATCG
contig_id                        
cl1_42      423   512    25   123
cl1_41      632   124   614    73
cl2_31       14    75   311   155
se1_51      724   132     4   102
se1_21      423   402   616   734
se2_53      153   751    31    55

有用的链接: 如何进行数据框的数据透视？

英文:

First, concat your dataframes then pivot them:

out = (pd.concat([sample_df, sample_df_2])
         .pivot(index=&#39;contig_id&#39;, columns=&#39;tetran&#39;, values=&#39;Frequency&#39;))
print(out)
# Output
tetran     AAAT  ACCT  ATCG  CTAT
contig_id                        
cl1_41      614   632    73   124
cl1_42       25   423   123   512
cl2_31      311    14   155    75
se1_21      616   423   734   402
se1_51        4   724   102   132
se2_53       31   153    55   751

If you don't want the data to be sorted, use pivot_table:

out = (pd.concat([sample_df, sample_df_2])
         .pivot_table(index=&#39;contig_id&#39;, columns=&#39;tetran&#39;, values=&#39;Frequency&#39;, sort=False))
print(out)
# Output
tetran     ACCT  CTAT  AAAT  ATCG
contig_id                        
cl1_42      423   512    25   123
cl1_41      632   124   614    73
cl2_31       14    75   311   155
se1_51      724   132     4   102
se1_21      423   402   616   734
se2_53      153   751    31    55

Useful link: How can I pivot a dataframe?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python：如何从具有相同维度的多个数据框创建唯一的数据框

问题

答案1

如何高效生成特定范围内的唯一随机非零整数？

设置图像为Tkinter窗口背景？

Python函数始终缺少所需的参数，尽管已编写。

将图像分成规则的块。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。