在Node.js中的数组内存分配

huangapple go评论79阅读模式
英文:

Array Memory allocation in nodejs

问题

我正在处理一个巨大的数组,

它包含大约200,000个元素。基本上是一个字符串数组,每个字符串大约有50个字符的长度。经过查找,我发现每个字符需要2个字节,即每个元素需要100个字节。

因此,总内存分配应该累加到200,000 * 100 = ~20 MB

object-sizeofjs-sizeofsizeof似乎实现了相同的逻辑。

但考虑一下这个代码片段,

process.memoryUsage();
const paths = getAllFilePaths();
process.memoryUsage();

获取数组之前的输出,

external:25080
heapTotal:31178752
heapUsed:10427896 //10 MB
rss:51761152

获取数组之后的输出,

external:16888
heapTotal:173539328
heapUsed:134720896 //134 MB
rss:204070912

这增加了约124MB的heapUsed

getAllFilePaths()的实现:

const getAllFilePaths = function (_path, paths = []) {

    fs.readdirSync(_path).forEach(name => {
        const stat = fs.lstatSync(joinPath(_path, name))
        if (stat.isDirectory()) {
            getAllFilePaths(joinPath(_path, name), paths);
            return;
        }

        paths.push(joinPath(_path, name));
    });

    return paths;
};

为什么会使用这么多内存?这是期望的行为吗,还是getAllFilePaths()函数可能会导致内存泄漏?

英文:

I am dealing with a huge array,

It contains ~200,000 elements. Basically its a array of strings. Each string being ~50 characters in length. After looking around I found it would take 2 bytes for 1 character, i.e 100 bytes for 1 element.

therefore, the total memory allocation should add up to 200,000 * 100 = ~20 MB

object-sizeof, js-sizeof, sizeof seems to implement same logic.

But consider this snippet,

process.memoryUsage();
const paths = getAllFilePaths();
process.memoryUsage();

Output before getting array,

external:25080
heapTotal:31178752
heapUsed:10427896 //10 MB
rss:51761152

Output after getting array,

external:16888
heapTotal:173539328
heapUsed:134720896 //134 MB
rss:204070912

This is ~124MB addition to heapUsed.

Implementation of getAllFilePaths():

const getAllFilePaths = function (_path, paths = []) {

    fs.readdirSync(_path).forEach(name => {
        const stat = fs.lstatSync(joinPath(_path, name))
        if (stat.isDirectory()) {
            getAllFilePaths(joinPath(_path, name), paths);
            return;
        }

        paths.push(joinPath(_path, name));
    });

    return paths;
};

Why is so much memory being used ? Is this the desired behaviour or somehow getAllFilePaths() function could possibly be leaking memory ?

答案1

得分: 3

V8开发者在此。有两点可以解释您的期望与测量结果之间的差异:

(1) 一个字符串数组需要比仅包含字符串字符更多的内存。在内存中,字符串对象有一个头部,64位系统上占据16个字节(一个指针大小的“shape”指针以及两个32位字段,用于哈希和长度)。根据字符串的构造方式,它们可能在内部使用不同的表示方式;头部加字符是最简单的形式。此外,数组本身对于每个元素都有一个指针大小的条目,至少增加了另外200,000 * 8字节 = 1.5MB - 动态增长的数组在需要增长时会过度分配,以便不必为每个添加而增长,如果数组在过度分配后停止增长,可能会浪费空间。

(2) 据我所知,process.memoryUsage() 只是返回当前堆使用统计信息,其中可能包含由先前操作留下的垃圾。要确定某物的内存消耗,建议在每次测量之前显式触发完整的GC循环。具体来说:使用 --expose-gc 启动Node,并在每次使用 process.memoryUsage() 之前调用 global.gc()

为了完整起见,我提一下:字符串的每个字符可能占据1或2个字节,这取决于其内容。对于每个单独的字符串,每个字符占据相同的数量,因此单个非ASCII字符会迫使整个字符串变成双字节。对于嵌入提供的字符串(如文件名),嵌入者还必须支持一字节优化;我不知道Node的文件API是否支持这一点。

1 “指针大小”意味着现在是64位 = 8字节;随着V8 8.0中“指针压缩”可用,这将缩小到4字节(如果选择部署指针压缩版本)。

英文:

V8 developer here. Two points come to mind to explain the discrepancy between your expectations and measurements:

(1) An array of strings needs more memory than just the strings' characters. In memory, a string object has a header that takes 16 bytes on a 64-bit system (a pointer-sized1 "shape" pointer plus two 32-bit fields for hash and length). Depending on how exactly the strings are constructed, they might also use different representations internally; header + characters is the simplest form. Additionally, the array itself has a pointer-sized entry for each element, adding at least another 200,000 * 8 bytes = 1.5MB -- dynamically-grown arrays over-allocate when they have to grow so that they don't have to grow for every addition, which can waste space if the array is unlucky enough to stop growing right after having over-allocated.

(2) AFAIK process.memoryUsage() simply returns the current heap usage statistics, which can contain garbage left behind by previous operations. To determine the memory consumption of something, it is advisable to explicitly trigger a full GC cycle before every measurement. Specifically: start Node with --expose-gc and call global.gc() before every process.memoryUsage().

For completeness, I'll mention: strings can take 1 or 2 bytes per character depending on their contents. Per individual string, each character takes the same amount, so a single non-ASCII character forces the entire string to be two-byte. For embedder-provided strings (like file names), the embedder also has to play along to support the one-byte optimization; I don't know whether Node's file API does this.

1 "pointer-sized" means 64 bits = 8 bytes nowadays; with "pointer-compression" becoming available in V8 8.0 this shrinks to 4 bytes (if you choose to deploy a pointer-compressed build).

答案2

得分: 1

在这里进行了一个小测试:内存泄漏测试

这似乎表明,硬编码到一个数组中的 200,000 个长度为 50 个字符的项目会输出以下内容:

{ 
rss: 58232832,
heapTotal: 40378368,
heapUsed: 25490136, // 约 25 MB
external: 8272 
}
英文:

Did a small test here: Memory Leak Test

This seems to show that 200,000 items of 50 chars hardcoded into an array outputs the following:

{ 
rss: 58232832,
heapTotal: 40378368,
heapUsed: 25490136, // ~25 MB
external: 8272 
}

huangapple
  • 本文由 发表于 2020年1月6日 14:59:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/59607913.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定