2023年3月9日 17:54:19go评论109阅读模式

英文:

Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

问题

有一个函数可以使用数据框来连接这两个数据文件，使得标题包含所有样本文件的物种，但不重复，行数是样本数量，每行显示该样本中该物种的读数，如果该样种在该样本中不存在则为0吗？

对于上面的示例，我想要物种矩阵如下所示：

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16

我不太熟悉Pandas，所以这是我迄今为止尝试过的代码：

import pandas as pd
import numpy as np
import glob
path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []
for file in sc_files:
  df_sample = pd.read_csv(file, sep = '\t')
  df_sample = df_sample.set_index("Sample ID")
  df_sample = df_sample.transpose()
  df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
  df_sc.append(df_sample)
df_matrix = pd.concat(df_sc, axis = 1).fillna(0)

这是我得到的输出：

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0

我应该如何做才能使所有包含特定物种名称（在这种情况下是Bacillus subtilis）的样本计数出现在同一列中？

我尝试删除

df_sample = df_sample[~df_sample.index.duplicated(keep='first')]

但无论我保留它还是删除它，结果都是相同的。

英文:

Suppose I have 2 files that list the species count. Like this:

sample1_sc.tsv

Sample_1 | Sample ID
464 | Bacillus subtilis
116 | Escherichia coli
62 | Vibrio cholerae serotype 1

sample2_sc.tsv

Sample_2 | Sample ID
364 | Bacillus subtilis
120 | Homo sapiens
16 | Yersinia pestis
16 | Danio rerio

is there a function to concat both data files using dataframes such that the header contains the Species of all the sample files without repetition and the no. of rows are the no. samples, each row showing the read count of the species in that sample and 0 if the species is not in the sample?

For the above ex, I want the species matrix to look like this:

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16

I am not as familiar with Pandas, so here's the code I have tried so far:

import pandas as pd
import numpy as np
import glob
path = &quot;/content/&quot;
sc_files = glob.glob(path + &quot;*.tsv&quot;)
df_sc = []
for file in sc_files:
  df_sample = pd.read_csv(file, sep = &#39;\t&#39;)
  df_sample = df_sample.set_index(&quot;SampleID&quot;)
  df_sample = df_sample.transpose()
  df_sample = df_sample[~df_sample.index.duplicated(keep=&#39;first&#39;)]
  df_sc.append(df_sample)
df_matrix = pd.concat(df_sc, axis = 1).fillna(0)

And this is the output I'm getting:

SampleID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0

How do I make it so all the samples' count containing a specific species name (Bacillus subtilis in this case) appear in the same column?

I tried to remove the
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]

but whether I keep it or remove it, it gives me the same result?

答案1

得分: 1

我会这样使用concat（小心轴的方向！）：

dfs = [df1, df2]
out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
         .fillna(0, downcast='infer').T
         .rename_axis(index='Sample ID', columns=None).reset_index()
       )

输出：

Sample ID  Bacillus subtilis  Escherichia coli  Vibrio cholerae serotype 1  Homo sapiens  Yersinia pestis  Danio rerio
Sample_1                 464               116                          62             0                0            0
Sample_2                 364                 0                           0           120               16           16

英文:

I would use concat this way, (be careful of the axis!):

dfs = [df1, df2]
out = (pd.concat([d.set_index(&#39;Sample ID&#39;) for d in dfs], axis=1)
         .fillna(0, downcast=&#39;infer&#39;).T
         .rename_axis(index=&#39;Sample ID&#39;, columns=None).reset_index()
       )

Output:

Sample ID  Bacillus subtilis  Escherichia coli  Vibrio cholerae serotype 1  Homo sapiens  Yersinia pestis  Danio rerio
Sample_1                 464               116                          62             0                0            0
Sample_2                 364                 0                           0           120               16           16

答案2

得分: 0

你也可以使用 pd.concat() 和 pd.pivot_table()。

df = pd.concat([df1, df2])
df = (df.pivot_table(index='Sample ID', columns='Species', values='Count', fill_value=0)
        .reset_index())

Species Sample ID  Bacillus subtilis  Danio rerio  Escherichia coli  Homo sapiens  Vibrio cholerae serotype 1  Yersinia pestis
0        Sample_1                464            0               116             0                          62                0
1        Sample_2                364           16                 0           120                           0               16

注意输出中的列的顺序发生了变化，这是因为 df.pivot_table 默认按字母顺序对列进行排序。

英文:

You can also use pd.concat() and pd.pivot_table()

df = pd.concat([df1, df2])
df = (df.pivot_table(index=&#39;Sample ID&#39;, columns=&#39;Species&#39;, values=&#39;Count&#39;, fill_value=0)
        .reset_index())

Species Sample ID  Bacillus subtilis  Danio rerio  Escherichia coli  Homo sapiens  Vibrio cholerae serotype 1  Yersinia pestis
0        Sample_1                464            0               116             0                          62                0
1        Sample_2                364           16                 0           120                           0               16

Note There is a change of columns order in the output because df.pivot_table by default sorts the columns alphabetically.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

问题

答案1

答案2

如何在matplotlib绘图中向子图添加X和Y组标签？

Running Linux Commands in Jupyter Notebook

用列最小差异和的值替换缺失值

改变乌龟图形的颜色按键按下时。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。