英文:
extracting unique pairs from large correlation matrix
问题
我有一个相当大的数据集(~5万条记录),用它生成一个相关矩阵。这个过程很顺利,只用了大约20GB的内存。
然后,我想从中提取唯一的成对组合,并将其转换为一个数据框。在这一步遇到了问题。要么使用了太多内存,要么超出了索引变量的范围。我知道有超过20亿种组合,所以我知道它会在大小上变得庞大一些,但仍然...
我尝试了不同的方法来实现这个目标,但都没有成功。
模拟数据:
df = matrix(runif(1), nrow = 50000, ncol = 50000, dimnames = list(seq(1, 50000, by = 1), seq(1, 50000, by = 1)))
尝试从相关矩阵中提取上/下三角,并对其进行重塑:
df[lower.tri(df, diag = T),] = NA
df = reshape2::melt(df, na.rm = T)
出现了错误:
Error in df[lower.tri(bla, diag = T), ] = NA :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522
如果只执行以下代码,也会出现相同的错误:
df = df[lower.tri(df, diag = T),]
(我确实阅读了 https://stackoverflow.com/questions/24335692/large-matrices-in-r-long-vectors-not-supported-yet,但对我的情况没有找到有用的信息)
我还尝试了:
df = subset(as.data.frame(as.table(df)),
match(Var1, names(annotation_table)) > match(Var2, names(annotation_table)))
使用仅限于R基础包,但在大约一天后内存耗尽。这是最消耗内存的部分:as.data.frame(as.table(df))
,所以我也尝试用reshape2::melt(df)
替代,但也耗尽了内存。
我在一台拥有128GB RAM的Ubuntu机器上运行这段代码。我也有更大的机器,但我原以为这么多内存应该足够了。
非常感谢您的任何帮助。谢谢。
英文:
I have a fairly large dataset (~50K entries) which I use to generate a correlation matrix. This works well, using "only" ~20GB RAM.
Then, I want to extract only the unique pairwise combinations from it and convert it into a data frame. This is where I run into issues. Either too much RAM usage or overflowing the indexing variable(s). I know there are >2B combinations, so I am aware it explodes a bit in size, but still..
I have tried different ways to achieve this, but with no success.
Mock data:
df = matrix(runif(1),nrow=50000, ncol=50000, dimnames=list(seq(1,50000,by=1), seq(1,50000,by=1)))
Trying to extract upper/lower triangle from the correlation matrix and then reshape it:
df[lower.tri(df, diag = T),] = NA
df = reshape2::melt(df, na.rm = T)
crashes with:
Error in df[lower.tri(bla, diag = T), ] = NA :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522
It crashes with the same error if you do only: df = df[lower.tri(df, diag = T),]
(I did read through https://stackoverflow.com/questions/24335692/large-matrices-in-r-long-vectors-not-supported-yet but I didn't find it helpful for my situation)
I also tried:
df = subset(as.data.frame(as.table(df)),
match(Var1, names(annotation_table)) > match(Var2, names(annotation_table)))
to use only R-base packages, but it eventually ran out of memory after ~1 day. This is the most RAM intensive part: as.data.frame(as.table(df))
so I tried also replacing it with reshape2::melt(df)
but it also ran out of RAM
I am running the code on an Ubuntu machine with 128GB RAM. I do have larger machines, but i would've expected that this amount of RAM should suffice.
Any help would be highly appreciated. Thank you.
答案1
得分: 6
如果你有像你说的那么多RAM,那么对于n
远大于6,这确实应该可以正常工作。
set.seed(0)
n <- 6L
x <- provideDimnames(cor(matrix(rnorm(as.double(n) * n), n, n)))
x
A B C D E F
A 1.00000000 0.42679900 0.113100027 -0.03952030 -0.02406114 -0.693427730
B 0.42679900 1.00000000 0.519377903 0.06136646 -0.51713799 -0.331961466
C 0.11310003 0.51937790 1.000000000 -0.43996491 -0.32225557 -0.006199606
D -0.03952030 0.06136646 -0.439964909 1.00000000 -0.42053358 0.537301520
E -0.02406114 -0.51713799 -0.322255571 -0.42053358 1.00000000 -0.367595700
F -0.69342773 -0.33196147 -0.006199606 0.53730152 -0.36759570 1.000000000
s <- seq_len(n) - 1L
nms <- dimnames(x)
dat <- data.frame(val = x[sequence(s, seq.int(1L, length(x), n))],
row = gl(n, 1L, labels = nms[[1L]])[sequence(s, 1L)],
col = rep.int(gl(n, 1L, labels = nms[[2L]]), s))
dat
val row col
1 0.426798998 A B
2 0.113100027 A C
3 0.519377903 B C
4 -0.039520302 A D
5 0.061366463 B D
6 -0.439964909 C D
7 -0.024061141 A E
8 -0.517137993 B E
9 -0.322255571 C E
10 -0.420533577 D E
11 -0.693427730 A F
12 -0.331961466 B F
13 -0.006199606 C F
14 0.537301520 D F
15 -0.367595700 E F
如果你使用的是早于4.0.0版本的R,在那里sequence
的定义不同,那么你可能需要像这样:
sequence <- function(nvec, from = 1L, by = 1L)
unlist(.mapply(seq.int, list(from = from, by = by, length.out = nvec), NULL),
recursive = FALSE, use.names = FALSE)
请注意,seq.int
接受integer
和double
参数,并且如果它确定结果会溢出integer
,则会方便地返回double
而不是integer
,在这种情况下,当n * n > .Machine$integer.max
时会发生这种情况。
M <- .Machine$integer.max
typeof(seq.int(from = 1L, to = 1 + M, by = M))
[1] "double"
英文:
If you have as much RAM as you say, then this really should work without issue for n
much greater than 6.
set.seed(0)
n <- 6L
x <- provideDimnames(cor(matrix(rnorm(as.double(n) * n), n, n)))
x
A B C D E F
A 1.00000000 0.42679900 0.113100027 -0.03952030 -0.02406114 -0.693427730
B 0.42679900 1.00000000 0.519377903 0.06136646 -0.51713799 -0.331961466
C 0.11310003 0.51937790 1.000000000 -0.43996491 -0.32225557 -0.006199606
D -0.03952030 0.06136646 -0.439964909 1.00000000 -0.42053358 0.537301520
E -0.02406114 -0.51713799 -0.322255571 -0.42053358 1.00000000 -0.367595700
F -0.69342773 -0.33196147 -0.006199606 0.53730152 -0.36759570 1.000000000
s <- seq_len(n) - 1L
nms <- dimnames(x)
dat <- data.frame(val = x[sequence(s, seq.int(1L, length(x), n))],
row = gl(n, 1L, labels = nms[[1L]])[sequence(s, 1L)],
col = rep.int(gl(n, 1L, labels = nms[[2L]]), s))
dat
val row col
1 0.426798998 A B
2 0.113100027 A C
3 0.519377903 B C
4 -0.039520302 A D
5 0.061366463 B D
6 -0.439964909 C D
7 -0.024061141 A E
8 -0.517137993 B E
9 -0.322255571 C E
10 -0.420533577 D E
11 -0.693427730 A F
12 -0.331961466 B F
13 -0.006199606 C F
14 0.537301520 D F
15 -0.367595700 E F
If you are using a version of R older than 4.0.0, where sequence
is defined differently, then you'll want something like:
sequence <- function(nvec, from = 1L, by = 1L)
unlist(.mapply(seq.int, list(from = from, by = by, length.out = nvec), NULL),
recursive = FALSE, use.names = FALSE)
Note that seq.int
accepts both integer
and double
arguments and conveniently returns a double
rather than an integer
if it determines that the result would overflow integer
, which happens in this case when n * n > .Machine$integer.max
.
M <- .Machine$integer.max
typeof(seq.int(from = 1L, to = 1 + M, by = M))
[1] "double"
答案2
得分: 2
以下是翻译好的部分:
好的,经过进一步的研究和尝试其他方法,我找到了一个在先前的 [帖子](https://stackoverflow.com/a/18128052/1116518) 中最终有效的解决方案:
```R
upper_ind = which(upper.tri(df, diag=F), arr.ind = T)
gc() # 清理并释放一些内存
df = data.frame(first = dimnames(df)[[2]][upper_ind[,2]],
second = dimnames(df)[[1]][upper_ind[,1]],
correlation = df[upper_ind])
作为参考,在我的真实数据上测试它(49,105 x 49,105 的相关矩阵):
- 仅检索相关矩阵的上三角元素的索引(第一个命令)达到了约90GB的内存使用量
- 汇总矩阵(第二个命令)达到了约100GB的内存使用量
如果你有如此大的数据集,我强烈建议在这两个命令之间调用垃圾回收器,因为它实际上有所帮助。时间上,不到10分钟。虽然不是理想的解决方案,但考虑到我的设置和时间限制,这是一个解决方案。
感谢 @Robert Hacken 指出 df[lower.tri(df, diag = T),] = NA
中的错误 ,
(即,在关闭括号之前的逗号应该被删除)。
我认为 @Mikael Jagan 提出的解决方案可能更节省内存。
更新
借助 @Mikael 对他答案的更新和一些微调,他的方法也适用于非常大的数据集。微调主要是使用 double
而不是 integer
,因为我处理的数据超出了整数的范围(例如,数据的大小为:length(x) = 2,411,301,025
,而整数限制为:.Machine$integer.max = 2,147,483,647
)。当然,他的目的是尽可能提高速度,因此使用整数而不是双精度,但对于我的用例来说这是不可能的。
所以修改 @Mikael 的代码如下:
generate_sequence = function(nvec, from = 1, by = 1)
unlist(.mapply(seq, list(from = from, by = by, length.out = nvec), NULL),
recursive = FALSE, use.names = FALSE)
以及其余部分:
set.seed(0)
n <- 6
x <- provideDimnames(cor(matrix(rnorm(n * n), n, n)))
s <- seq_len(n) - 1
nms <- dimnames(x)
dat <- data.frame(val = x[sequence(s, seq(1, length(x), n))],
row = gl(n, 1, labels = nms[[1]])[sequence(s, 1)],
col = rep(gl(n, 1, labels = nms[[2]]), s))
这将在我的数据上完成任务,内存峰值为约70GB,大约需要2分钟。再次感谢 @Mikael 的答案。
如果在序列生成方面遇到问题,请阅读评论以获取更多信息。
<details>
<summary>英文:</summary>
Okay, after digging a bit more and trying other stuff, I found one solution that eventually worked in a previous [post](https://stackoverflow.com/a/18128052/1116518):
upper_ind = which(upper.tri(df, diag=F), arr.ind = T)
gc() # clean up and free some RAM
df = data.frame(first = dimnames(df)[[2]][upper_ind[,2]],
second = dimnames(df)[[1]][upper_ind[,1]],
correlation = df[upper_ind])
For reference, testing it out on my real data (49,105 x 49,105 correlation matrix):
* only retrieving the index of the elements within the upper triangle of the correlation matrix (first command) peaked at ~90GB RAM
* putting together the matrix (second command) peaked at ~100GB of RAM
If you have such large datasets, I really suggest you call the garbage collector between the two commands as it actually helped. Timewise, it took less than 10 minutes. It is not ideal, but given my setup and time constraints, it is a solution.
Thank you @Robert Hacken for spotting that erroneous `,` in `df[lower.tri(df, diag = T),] = NA` (i.e., the comma before the closing bracket should be removed.
I think that what @Mikael Jagan has proposed might be more memory efficient.
**Update**
With the updates from @Mikael on his answer and a bit of tweaking, his approach is also working on very large datasets. The tweaking is mainly to use `double` instead of `integer` as the data that I work with passes the range of an integer (e.g., the size of the data is: `length(x) = 2,411,301,025` while the integer limit is: `.Machine$integer.max = 2,147,483,647). Of course, his intention was to speed it up as much as possible, hence using integers instead of doubles, but for my use case this is not possible.
So modifying @Mikael's code to:
generate_sequence = function(nvec, from = 1, by = 1)
unlist(.mapply(seq, list(from = from, by = by, length.out = nvec), NULL),
recursive = FALSE, use.names = FALSE)
and the rest to:
set.seed(0)
n <- 6
x <- provideDimnames(cor(matrix(rnorm(n * n), n, n)))
s <- seq_len(n) - 1
nms <- dimnames(x)
dat <- data.frame(val = x[sequence(s, seq(1, length(x), n))],
row = gl(n, 1, labels = nms[[1]])[sequence(s, 1)],
col = rep(gl(n, 1, labels = nms[[2]]), s))
will get the job done on my data in ~2m and peak at ~70GB RAM. Thanks again for your answer @Mikael.
If you run into trouble with the sequence generation, please read through the comments as well
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论