英文:
Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)
问题
有一个函数可以使用数据框来连接这两个数据文件,使得标题包含所有样本文件的物种,但不重复,行数是样本数量,每行显示该样本中该物种的读数,如果该样种在该样本中不存在则为0吗?
对于上面的示例,我想要物种矩阵如下所示:
Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16
我不太熟悉Pandas,所以这是我迄今为止尝试过的代码:
import pandas as pd
import numpy as np
import glob
path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []
for file in sc_files:
df_sample = pd.read_csv(file, sep = '\t')
df_sample = df_sample.set_index("Sample ID")
df_sample = df_sample.transpose()
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
df_sc.append(df_sample)
df_matrix = pd.concat(df_sc, axis = 1).fillna(0)
这是我得到的输出:
Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0
我应该如何做才能使所有包含特定物种名称(在这种情况下是Bacillus subtilis)的样本计数出现在同一列中?
我尝试删除
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
但无论我保留它还是删除它,结果都是相同的。
英文:
Suppose I have 2 files that list the species count. Like this:
sample1_sc.tsv
Sample_1 | Sample ID
464 | Bacillus subtilis
116 | Escherichia coli
62 | Vibrio cholerae serotype 1
sample2_sc.tsv
Sample_2 | Sample ID
364 | Bacillus subtilis
120 | Homo sapiens
16 | Yersinia pestis
16 | Danio rerio
is there a function to concat both data files using dataframes such that the header contains the Species of all the sample files without repetition and the no. of rows are the no. samples, each row showing the read count of the species in that sample and 0 if the species is not in the sample?
For the above ex, I want the species matrix to look like this:
Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16
I am not as familiar with Pandas, so here's the code I have tried so far:
import pandas as pd
import numpy as np
import glob
path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []
for file in sc_files:
df_sample = pd.read_csv(file, sep = '\t')
df_sample = df_sample.set_index("SampleID")
df_sample = df_sample.transpose()
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
df_sc.append(df_sample)
df_matrix = pd.concat(df_sc, axis = 1).fillna(0)
And this is the output I'm getting:
SampleID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0
How do I make it so all the samples' count containing a specific species name (Bacillus subtilis in this case) appear in the same column?
I tried to remove the
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
but whether I keep it or remove it, it gives me the same result?
答案1
得分: 1
我会这样使用concat
(小心轴的方向!):
dfs = [df1, df2]
out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
.fillna(0, downcast='infer').T
.rename_axis(index='Sample ID', columns=None).reset_index()
)
输出:
Sample ID Bacillus subtilis Escherichia coli Vibrio cholerae serotype 1 Homo sapiens Yersinia pestis Danio rerio
Sample_1 464 116 62 0 0 0
Sample_2 364 0 0 120 16 16
英文:
I would use concat
this way, (be careful of the axis!):
dfs = [df1, df2]
out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
.fillna(0, downcast='infer').T
.rename_axis(index='Sample ID', columns=None).reset_index()
)
Output:
Sample ID Bacillus subtilis Escherichia coli Vibrio cholerae serotype 1 Homo sapiens Yersinia pestis Danio rerio
Sample_1 464 116 62 0 0 0
Sample_2 364 0 0 120 16 16
答案2
得分: 0
你也可以使用 pd.concat()
和 pd.pivot_table()
。
df = pd.concat([df1, df2])
df = (df.pivot_table(index='Sample ID', columns='Species', values='Count', fill_value=0)
.reset_index())
Species Sample ID Bacillus subtilis Danio rerio Escherichia coli Homo sapiens Vibrio cholerae serotype 1 Yersinia pestis
0 Sample_1 464 0 116 0 62 0
1 Sample_2 364 16 0 120 0 16
注意 输出中的列的顺序发生了变化,这是因为 df.pivot_table
默认按字母顺序对列进行排序。
英文:
You can also use pd.concat()
and pd.pivot_table()
df = pd.concat([df1, df2])
df = (df.pivot_table(index='Sample ID', columns='Species', values='Count', fill_value=0)
.reset_index())
Species Sample ID Bacillus subtilis Danio rerio Escherichia coli Homo sapiens Vibrio cholerae serotype 1 Yersinia pestis
0 Sample_1 464 0 116 0 62 0
1 Sample_2 364 16 0 120 0 16
Note There is a change of columns order
in the output because df.pivot_table
by default sorts the columns alphabetically.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论