英文:
Python: How to create a unique dataframe from multiple dataframes with the same dimensions
问题
给定一个带有指定列名的空数据框:
colnames = ('ACCT', 'CTAT', 'AAAT', 'ATCG')*3
df = pd.DataFrame(columns=colnames)
我想要循环遍历具有以下结构的数据框(以2个示例为例):
sample_df = pd.DataFrame()
sample_df['tetran'] = colnames
sample_df['Frequency'] = (423, 512, 25, 123, 632, 124, 614, 73, 14, 75, 311, 155)
conids = ("cl1_42", "cl1_41", "cl2_31")
rep_conids = [val for val in conids for _ in range(4)]
sample_df['contig_id'] = rep_conids
sample_df_2 = pd.DataFrame()
sample_df_2['tetran'] = colnames
sample_df_2['Frequency'] = (724, 132, 4, 102, 423, 402, 616, 734, 153, 751, 31, 55)
conids_2 = ("se1_51", "se1_21", "se2_53")
rep_conids_2 = [val for val in conids_2 for _ in range(4)]
sample_df_2['contig_id'] = rep_conids_2
目标是:
- 将'sample_df'中的每个'Frequency'值添加到'df'的相应'tetraN'值中,并添加一个新列作为'sample_df'的['contig_id']。
有多个'sample_df'数据框,所以期望输出如下:
index | ACCT | CTAT | AAAT | ATCG |
---|---|---|---|---|
cl1_42 | 423 | 512 | 25 | 123 |
cl1_41 | 632 | 124 | 614 | 73 |
cl2_31 | 14 | 75 | 311 | 155 |
se1_51 | 724 | 132 | 4 | 102 |
se1_21 | 423 | 402 | 616 | 734 |
se2_53 | 153 | 751 | 31 | 55 |
英文:
given an empty dataframe with assigned column names :
colnames = ('ACCT', 'CTAT', 'AAAT', 'ATCG')*3
df = pd.DataFrame(columns=colnames)
I want to loop over dataframes which have the below structure: (giving 2 for demostration)
sample_df = pd.DataFrame()
sample_df['tetran'] = colnames
sample_df['Frequency'] = (423, 512, 25, 123,632,124,614,73,14,75,311,155)
conids = ("cl1_42", "cl1_41", "cl2_31")
rep_conids = [val for val in conids for _ in range(4)]
sample_df['contig_id'] = rep_conids
sample_df_2 = pd.DataFrame()
sample_df_2['tetran'] = colnames
sample_df_2['Frequency'] = (724, 132, 4, 102,423,402,616,734,153,751,31,55)
conids_2 = ("se1_51", "se1_21", "se2_53")
rep_conids_2 = [val for val in conids_2 for _ in range(4)]
sample_df_2['contig_id'] = rep_conids_2
The objective is:
- Add each 'Frequency' value from the 'sample_df's to the corresponding 'tetraN' value of the 'df' and add a new column to be the sample_df['contig_id']
There are multiple 'sample_df' dataframes , so this is the idea of the desired output:
index | ACCT | CTAT | AAAT | ATCG |
---|---|---|---|---|
cl1_42 | 423 | 512 | 25 | 123 |
cl1_41 | 632 | 124 | 614 | 73 |
cl2_31 | 14 | 75 | 311 | 155 |
se1_51 | 724 | 132 | 4 | 102 |
se1_21 | 423 | 402 | 616 | 734 |
se2_53 | 153 | 751 | 31 | 55 |
I know how to do this in R but I need this to be done in python so I cannot add here what I tried due it is in R.
Thanks for your time
答案1
得分: 1
out = (pd.concat([sample_df, sample_df_2])
.pivot(index='contig_id', columns='tetran', values='Frequency'))
print(out)
# 输出
tetran AAAT ACCT ATCG CTAT
contig_id
cl1_41 614 632 73 124
cl1_42 25 423 123 512
cl2_31 311 14 155 75
se1_21 616 423 734 402
se1_51 4 724 102 132
se2_53 31 153 55 751
如果你不想数据排序,可以使用 pivot_table
:
out = (pd.concat([sample_df, sample_df_2])
.pivot_table(index='contig_id', columns='tetran', values='Frequency', sort=False))
print(out)
# 输出
tetran ACCT CTAT AAAT ATCG
contig_id
cl1_42 423 512 25 123
cl1_41 632 124 614 73
cl2_31 14 75 311 155
se1_51 724 132 4 102
se1_21 423 402 616 734
se2_53 153 751 31 55
有用的链接: 如何进行数据框的数据透视?
英文:
First, concat
your dataframes then pivot
them:
out = (pd.concat([sample_df, sample_df_2])
.pivot(index='contig_id', columns='tetran', values='Frequency'))
print(out)
# Output
tetran AAAT ACCT ATCG CTAT
contig_id
cl1_41 614 632 73 124
cl1_42 25 423 123 512
cl2_31 311 14 155 75
se1_21 616 423 734 402
se1_51 4 724 102 132
se2_53 31 153 55 751
If you don't want the data to be sorted, use pivot_table
:
out = (pd.concat([sample_df, sample_df_2])
.pivot_table(index='contig_id', columns='tetran', values='Frequency', sort=False))
print(out)
# Output
tetran ACCT CTAT AAAT ATCG
contig_id
cl1_42 423 512 25 123
cl1_41 632 124 614 73
cl2_31 14 75 311 155
se1_51 724 132 4 102
se1_21 423 402 616 734
se2_53 153 751 31 55
Useful link: How can I pivot a dataframe?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论