Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

huangapple go评论65阅读模式
英文:

Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

问题

有一个函数可以使用数据框来连接这两个数据文件,使得标题包含所有样本文件的物种,但不重复,行数是样本数量,每行显示该样本中该物种的读数,如果该样种在该样本中不存在则为0吗?

对于上面的示例,我想要物种矩阵如下所示:

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16

我不太熟悉Pandas,所以这是我迄今为止尝试过的代码:

import pandas as pd
import numpy as np
import glob

path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []

for file in sc_files:
  df_sample = pd.read_csv(file, sep = '\t')
  df_sample = df_sample.set_index("Sample ID")
  df_sample = df_sample.transpose()
  df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
  df_sc.append(df_sample)

df_matrix = pd.concat(df_sc, axis = 1).fillna(0)

这是我得到的输出:

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0

我应该如何做才能使所有包含特定物种名称(在这种情况下是Bacillus subtilis)的样本计数出现在同一列中?

我尝试删除

df_sample = df_sample[~df_sample.index.duplicated(keep='first')]

但无论我保留它还是删除它,结果都是相同的。

英文:

Suppose I have 2 files that list the species count. Like this:

sample1_sc.tsv

Sample_1 | Sample ID
464 | Bacillus subtilis
116 | Escherichia coli
62 | Vibrio cholerae serotype 1

sample2_sc.tsv

Sample_2 | Sample ID
364 | Bacillus subtilis
120 | Homo sapiens
16 | Yersinia pestis
16 | Danio rerio

is there a function to concat both data files using dataframes such that the header contains the Species of all the sample files without repetition and the no. of rows are the no. samples, each row showing the read count of the species in that sample and 0 if the species is not in the sample?

For the above ex, I want the species matrix to look like this:

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16

I am not as familiar with Pandas, so here's the code I have tried so far:

import pandas as pd
import numpy as np
import glob

path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []

for file in sc_files:
  df_sample = pd.read_csv(file, sep = '\t')
  df_sample = df_sample.set_index("SampleID")
  df_sample = df_sample.transpose()
  df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
  df_sc.append(df_sample)

df_matrix = pd.concat(df_sc, axis = 1).fillna(0)

And this is the output I'm getting:

SampleID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0

How do I make it so all the samples' count containing a specific species name (Bacillus subtilis in this case) appear in the same column?

I tried to remove the
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]

but whether I keep it or remove it, it gives me the same result?

答案1

得分: 1

我会这样使用concat(小心轴的方向!):

dfs = [df1, df2]

out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
         .fillna(0, downcast='infer').T
         .rename_axis(index='Sample ID', columns=None).reset_index()
       )

输出:

Sample ID  Bacillus subtilis  Escherichia coli  Vibrio cholerae serotype 1  Homo sapiens  Yersinia pestis  Danio rerio
Sample_1                 464               116                          62             0                0            0
Sample_2                 364                 0                           0           120               16           16
英文:

I would use concat this way, (be careful of the axis!):

dfs = [df1, df2]

out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
         .fillna(0, downcast='infer').T
         .rename_axis(index='Sample ID', columns=None).reset_index()
       )

Output:

Sample ID  Bacillus subtilis  Escherichia coli  Vibrio cholerae serotype 1  Homo sapiens  Yersinia pestis  Danio rerio
Sample_1                 464               116                          62             0                0            0
Sample_2                 364                 0                           0           120               16           16

答案2

得分: 0

你也可以使用 pd.concat()pd.pivot_table()

df = pd.concat([df1, df2])
df = (df.pivot_table(index='Sample ID', columns='Species', values='Count', fill_value=0)
        .reset_index())

Species Sample ID  Bacillus subtilis  Danio rerio  Escherichia coli  Homo sapiens  Vibrio cholerae serotype 1  Yersinia pestis
0        Sample_1                464            0               116             0                          62                0
1        Sample_2                364           16                 0           120                           0               16

注意 输出中的列的顺序发生了变化,这是因为 df.pivot_table 默认按字母顺序对列进行排序。

英文:

You can also use pd.concat() and pd.pivot_table()

df = pd.concat([df1, df2])
df = (df.pivot_table(index='Sample ID', columns='Species', values='Count', fill_value=0)
        .reset_index())

Species Sample ID  Bacillus subtilis  Danio rerio  Escherichia coli  Homo sapiens  Vibrio cholerae serotype 1  Yersinia pestis
0        Sample_1                464            0               116             0                          62                0
1        Sample_2                364           16                 0           120                           0               16

Note There is a change of columns order in the output because df.pivot_table by default sorts the columns alphabetically.

huangapple
  • 本文由 发表于 2023年3月9日 17:54:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/75682920.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定