如何在时间序列分析中计算每个时间点的唯一基因数?

huangapple go评论60阅读模式
英文:

How to count unique genes at each timepoint across a time-course analysis?

问题

我有时间序列数据,每个时间点有一列基因。

我需要一种快速的方法来计算每个后续时间点的唯一(新颖)基因数量。

例如,如果我比较时间点3和时间点2,那么时间点3中有哪些新颖基因。然后,对于时间点4,与时间点2和3相比有哪些新颖基因。依此类推。

我有14个时间点和多个数据集,因此需要一种高效的方式来计算每个时间点有多少个新颖基因。

以下是数据的一小部分示例:

(原文中的数据部分未提供中文翻译,仅提供数据的描述。)

我尝试在Excel中手动完成这项任务,但这个过程耗时且容易出现人为错误。

非常感谢任何帮助。

英文:

I have time-course data with a list of genes(rows) at each timepoint (cols).

I need a quick way to count unique(novel) genes at each subsequent timepoint.

For example, if I compare timepoint 3 to timepoint 2, which genes are novel in timepoint 3. Then, for timepoint 4, which genes are novel compared to timepoints 2 and 3. And so on.
I have 14 timepoints and multiple datasets, so need an efficient way to calculate how many genes are novel at each timepoint.

This is a tiny sample of the data:

    X1           X2           X3           X4
1   LOC115711925 LOC115694843 LOC115696797 LOC115721738
2   LOC115697141 LOC115695410 LOC115705991 LOC115698757
3   LOC115695663 LOC115695505 LOC115720646 LOC115704937
4   LOC115697811 LOC115695663 LOC115709480 LOC115724472
5   LOC115710226 LOC115695751 LOC115707388 LOC115702544
6   LOC115699430 LOC115695753 LOC115711243 LOC115705803
7   LOC115719329 LOC115695880 LOC115701282 LOC115711243
8   LOC115709251 LOC115695882 LOC115695751 LOC115698778
9   LOC115716776 LOC115695990 LOC115698262 LOC115707330
10  LOC115707556 LOC115696236 LOC115715294 LOC115718803
11  LOC115717016 LOC115696976 LOC115720841 LOC115720837
12  LOC115703186 LOC115696984 LOC115698132 LOC115719149
13  LOC115715930 LOC115696989 LOC115702328 LOC115712227
14  LOC115719149 LOC115697003 LOC115720788 LOC115724518
15  LOC115694843 LOC115697717 LOC115712291 LOC115701008
16  LOC115702383 LOC115697737 LOC115717255 LOC115700185
17  LOC115718171 LOC115697757 LOC115720540 LOC115699220
18  LOC115716727 LOC115697813 LOC115709300 LOC115707967
19  LOC115721947 LOC115697989 LOC115710741 LOC115705222
20  LOC115707802 LOC115698069 LOC115699007 LOC115716814
21  LOC115707848 LOC115698103 LOC115718118 LOC115712507

I have tried to do this manually in excel, but the process is time consuming and prone to human error.
Very thankful for any help.

答案1

得分: 2

在基本的R中,你可以这样做:

aggregate(.~ind, subset(stack(df), !duplicated(values)), length)
  ind values
1  X1     21
2  X2     19
3  X3     18
4  X4     17

如果你不想考虑 X1,那么可以这样做:

aggregate(.~ind, subset(stack(df, -1), !duplicated(values)), length)
  ind values
1  X2     21
2  X3     18
3  X4     18
英文:

In base R you could do:

aggregate(.~ind, subset(stack(df), !duplicated(values)), length)
  ind values
1  X1     21
2  X2     19
3  X3     18
4  X4     17

If you do not want to take into consideration X1 then you do:

aggregate(.~ind, subset(stack(df, -1), !duplicated(values)), length)
  ind values
1  X2     21
2  X3     18
3  X4     18

答案2

得分: 0

我已经在数据框中添加了一些重复项(原始数据框输出)如下:

X2 X3 X4
19 20 20

我们可以使用purrrmap2map_intsetdiff结合使用:

library(purrr)
library(dplyr)

map2(df[-1], df[-ncol(df)], setdiff) %>%
  map_int(., length)

输出结果:

X2 X3 X4
19 18 18

修改后的数据:

df <- structure(list(X1 = c("LOC115711925", "LOC115697141", "LOC115695663", 
"LOC115697811", "LOC115710226", "LOC115699430", "LOC115719329", 
"LOC115709251", "LOC115716776", "LOC115707556", "LOC115717016", 
"LOC115703186", "LOC115715930", "LOC115719149", "LOC115694843", 
"LOC115702383", "LOC115718171", "LOC115716727", "LOC115721947", 
"LOC115707802", "LOC115707848"), X2 = c("LOC115711925", "LOC115695410", 
"LOC115695505", "LOC115695663", "LOC115695751", "LOC115695753", 
"LOC115695880", "LOC115695882", "LOC115695990", "LOC115696236", 
"LOC115696976", "LOC115696984", "LOC115696989", "LOC115697003", 
"LOC115697717", "LOC115697737", "LOC115697757", "LOC115697813", 
"LOC115697989", "LOC115698069", "LOC115698103"), X3 = c("LOC115696797", 
"LOC115705991", "LOC115720646", "LOC115709480", "LOC115707388", 
"LOC115711243", "LOC115711925", "LOC115695751", "LOC115698262", 
"LOC115711925", "LOC115720841", "LOC115698132", "LOC115702328", 
"LOC115720788", "LOC115712291", "LOC115717255", "LOC115720540", 
"LOC115709300", "LOC115710741", "LOC115699007", "LOC115718118"
), X4 = c("LOC115721738", "LOC115698757", "LOC115704937", "LOC115724472", 
"LOC115702544", "LOC115705803", "LOC115711243", "LOC115698778", 
"LOC115707330", "LOC115718803", "LOC115711925", "LOC115719149", 
"LOC115712227", "LOC115711925", "LOC115701008", "LOC115700185", 
"LOC115699220", "LOC115707967", "LOC115705222", "LOC115716814", 
"LOC115712507")), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", 
"15", "16", "17", "18", "19", "20", "21"))

注意:这是代码的翻译和数据的重现,不包括代码部分。

英文:

I have added some duplicates to the data frame (original data frame output) would have been:

X2 X3 X4 
19 20 20

We could use purrr map2 and map_int combined with setdiff

library(purrr)
library(dplyr)

map2(df[-1], df[-ncol(df)], setdiff) %&gt;% 
  map_int(., length)

output:

X2 X3 X4 
19 18 18 

modifed data:

df &lt;- structure(list(X1 = c(&quot;LOC115711925&quot;, &quot;LOC115697141&quot;, &quot;LOC115695663&quot;, 
&quot;LOC115697811&quot;, &quot;LOC115710226&quot;, &quot;LOC115699430&quot;, &quot;LOC115719329&quot;, 
&quot;LOC115709251&quot;, &quot;LOC115716776&quot;, &quot;LOC115707556&quot;, &quot;LOC115717016&quot;, 
&quot;LOC115703186&quot;, &quot;LOC115715930&quot;, &quot;LOC115719149&quot;, &quot;LOC115694843&quot;, 
&quot;LOC115702383&quot;, &quot;LOC115718171&quot;, &quot;LOC115716727&quot;, &quot;LOC115721947&quot;, 
&quot;LOC115707802&quot;, &quot;LOC115707848&quot;), X2 = c(&quot;LOC115711925&quot;, &quot;LOC115695410&quot;, 
&quot;LOC115695505&quot;, &quot;LOC115695663&quot;, &quot;LOC115695751&quot;, &quot;LOC115695753&quot;, 
&quot;LOC115695880&quot;, &quot;LOC115695882&quot;, &quot;LOC115695990&quot;, &quot;LOC115696236&quot;, 
&quot;LOC115696976&quot;, &quot;LOC115696984&quot;, &quot;LOC115696989&quot;, &quot;LOC115697003&quot;, 
&quot;LOC115697717&quot;, &quot;LOC115697737&quot;, &quot;LOC115697757&quot;, &quot;LOC115697813&quot;, 
&quot;LOC115697989&quot;, &quot;LOC115698069&quot;, &quot;LOC115698103&quot;), X3 = c(&quot;LOC115696797&quot;, 
&quot;LOC115705991&quot;, &quot;LOC115720646&quot;, &quot;LOC115709480&quot;, &quot;LOC115707388&quot;, 
&quot;LOC115711243&quot;, &quot;LOC115711925&quot;, &quot;LOC115695751&quot;, &quot;LOC115698262&quot;, 
&quot;LOC115711925&quot;, &quot;LOC115720841&quot;, &quot;LOC115698132&quot;, &quot;LOC115702328&quot;, 
&quot;LOC115720788&quot;, &quot;LOC115712291&quot;, &quot;LOC115717255&quot;, &quot;LOC115720540&quot;, 
&quot;LOC115709300&quot;, &quot;LOC115710741&quot;, &quot;LOC115699007&quot;, &quot;LOC115718118&quot;
), X4 = c(&quot;LOC115721738&quot;, &quot;LOC115698757&quot;, &quot;LOC115704937&quot;, &quot;LOC115724472&quot;, 
&quot;LOC115702544&quot;, &quot;LOC115705803&quot;, &quot;LOC115711243&quot;, &quot;LOC115698778&quot;, 
&quot;LOC115707330&quot;, &quot;LOC115718803&quot;, &quot;LOC115711925&quot;, &quot;LOC115719149&quot;, 
&quot;LOC115712227&quot;, &quot;LOC115711925&quot;, &quot;LOC115701008&quot;, &quot;LOC115700185&quot;, 
&quot;LOC115699220&quot;, &quot;LOC115707967&quot;, &quot;LOC115705222&quot;, &quot;LOC115716814&quot;, 
&quot;LOC115712507&quot;)), class = &quot;data.frame&quot;, row.names = c(&quot;1&quot;, &quot;2&quot;, 
&quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;, &quot;8&quot;, &quot;9&quot;, &quot;10&quot;, &quot;11&quot;, &quot;12&quot;, &quot;13&quot;, &quot;14&quot;, 
&quot;15&quot;, &quot;16&quot;, &quot;17&quot;, &quot;18&quot;, &quot;19&quot;, &quot;20&quot;, &quot;21&quot;))

huangapple
  • 本文由 发表于 2023年3月9日 13:25:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75680727.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定