2023年6月12日 15:02:18go评论67阅读模式

英文:

Rust serde get runtime heap size of Vec<serde_json::Value>

问题

I'm making a rust tool which migrates data using the REST API of an internal service. Essentially, it makes a GET request, the returned data is an array of JSON objects which is deserialized into a struct field of type serde_json::Value, gets a mutable array of it (as_array_mut) for a bit of processing and POSTs the result to another REST API.

This is done in batches of say, 10000 records per request, however the data can unpredictably change in size. Usually it's around 10MiB but sometimes it can jump to over 400MiB which can easily crash the internal service.

Because of this, I want a way to control how many records can be fetched per request based on the size of the response data, in other words, the heap size of Vec<serde_json::Value> during runtime. I've tried std::mem::size_of_val and the crate heapsize but they didn't work. I think one work around would be converting it to a string and get its length (the size doesn't have to be 100% accurate, a rough estimate is fine too) but that would mean there will be two copies of the JSON data. This is my last option but I wanted to know if there's any other alternative and efficient way to get the heap size.

Update - (response to @Caesar):
I was temporarily using this while waiting for any better approaches:
let size = serde_json::to_vec(docs)?.len();

Thanks to @Caesar I did a few benchmarking and here's what I got. size is from what I mentioned above, size_new and size_new_for is from Caesar's answer, difference being the first one uses .map(|v| { sizeof_val(v) }).sum() and the second is a simple for-in loop which increments the result to a variable.

rows: 1000
size raw: 1360727, fmt: 1.30 MiB, took: 4.980794ms
size_new raw: 3834194, fmt: 3.66 MiB, took: 716.486μs
size_new_for raw: 3834194, fmt: 3.66 MiB, took: 672.523μs

rows: 10000
size raw: 17778816, fmt: 16.96 MiB, took: 62.151661ms
size_new raw: 43805986, fmt: 41.78 MiB, took: 8.775323ms
size_new_for raw: 43805986, fmt: 41.78 MiB, took: 8.158837ms

rows: 50000
size raw: 84354219, fmt: 80.45 MiB, took: 199.82163ms
size_new raw: 175919470, fmt: 167.77 MiB, took: 26.010926ms
size_new_for raw: 175919470, fmt: 167.77 MiB, took: 27.084353ms

Ignoring the timings, there seems to be a huge difference in size from turning the entire thing to a vector of bytes (serde_json::to_string takes over 2 times longer than serde_json::to_vec but gives the same result). I'm sort of confused as to which one is an over-estimate here, isn't turning the entire thing to a string/byte array supposed to be an over-estimate or have I been using a grossly under-estimated approximation this whole time?

Here's the complete code:

let size = serde_json::to_vec(docs)?.len() as u64;
let size_new: usize = docs.iter().map(|v| {
    sizeof_val(v)
}).sum();
let mut size_new_for = 0;
for v in docs.iter() {
    size_new_for += sizeof_val(v);
}

英文:

I'm making a rust tool which migrates data using the REST API of an internal service. Essentially, it makes a GET request, the returned data is an array of JSON objects which is deserialized into a struct field of type serde_json::Value, gets a mutable array of it (as_array_mut) for a bit of processing and POSTs the result to another REST API.

Because of this, I want a way to control how many records can be fetched per request based on the size of the response data, in other words, the heap size of Vec<serde_json::Value> during runtime. I've tried std::mem::size_of_val and the crate heapsize but they didn't work. I think one work around would be converting it to a string and get its length (the size doesn't have to be 100% accurate, a rough estimate is fine too) but that would mean there will be two copies of the JSON data. This is my last option but I wanted to know if there's any other alternative and efficient way to get the heap size.

Update - (response to @Caesar):
I was temporarily using this while waiting for any better approaches:
let size = serde_json::to_vec(docs)?.len();

Thanks to @Caesar I did a few benchmarking and here's what I got. size is is from what I mentioned above, size_new and size_new_for is from Caesar's answer, difference being the first one uses .map(|v| { sizeof_val(v) }).sum() and the second is a simple for-in loop which increments the result to a variable.

rows: 1000
size raw: 1360727, fmt: 1.30 MiB, took: 4.980794ms
size_new raw: 3834194, fmt: 3.66 MiB, took: 716.486&#181;s
size_new_for raw: 3834194, fmt: 3.66 MiB, took: 672.523&#181;s

rows: 10000
size raw: 17778816, fmt: 16.96 MiB, took: 62.151661ms
size_new raw: 43805986, fmt: 41.78 MiB, took: 8.775323ms
size_new_for raw: 43805986, fmt: 41.78 MiB, took: 8.158837ms

rows: 50000
size raw: 84354219, fmt: 80.45 MiB, took: 199.82163ms
size_new raw: 175919470, fmt: 167.77 MiB, took: 26.010926ms
size_new_for raw: 175919470, fmt: 167.77 MiB, took: 27.084353ms

Here's the complete code:

let size = serde_json::to_vec(docs)?.len() as u64;
let size_new: usize = docs.iter().map(|v| {
    sizeof_val(v)
}).sum();
let mut size_new_for = 0;
for v in docs.iter() {
    size_new_for += sizeof_val(v);
}

答案1

得分: 1

计算 serde_json::Value 的精确内存大小因几个原因而有些棘手：

无法访问底层的 Map 类并询问其分配的容量
分配器会有开销，所以即使你知道分配的大小，也不能直接转化为实际需要的内存量。

无论如何，以下函数可能会提供一个可行的近似值。

fn sizeof_val(v: &serde_json::Value) -> usize {
    std::mem::size_of::<serde_json::Value>()
        + match v {
            serde_json::Value::Null => 0,
            serde_json::Value::Bool(_) => 0,
            serde_json::Value::Number(_) => 0, // 如果启用了 arbitrary_precision，则不正确。哎呀
            serde_json::Value::String(s) => s.capacity(),
            serde_json::Value::Array(a) => a.iter().map(sizeof_val).sum(),
            serde_json::Value::Object(o) => o
                .iter()
                .map(|(k, v)| {
                    std::mem::size_of::<String>()
                        + k.capacity()
                        + sizeof_val(v)
                        + std::mem::size_of::<usize>() * 3 // 作为粗略估计，我假装每个映射条目都有 3 个字的开销
                })
                .sum(),
        }
}

一些想法（主要是针对 Linux）：

如果你需要精确的内存大小，你可以通过直接测量进程的内存大小来更好地进行评估，使用 procfs::process::Process::myself().unwrap().status().unwrap().vmrss.unwrap() * 1024。这里需要注意的是，分配器 tend 通常不会迅速将内存归还给操作系统，所以你可能会高估内存使用量。
如果你使用自定义的分配器，你可能可以直接向其请求内存使用统计信息，参考这里。
与其担心控制大小，你可以让操作系统在内存使用过多时警告你，方法是在 memory.oom_control 上注册一个 eventfd（但我认为你可能需要自己实现，我没有看到一个方便的 crate）。
（loupe crate 也实现了分配大小的测量，但我不认为它支持 serde_json。）

英文:

Calculating the exact memory size of a serde_json::Value is somewhat tricky for several reasons

You can't access the underlying Map class and ask what capacity its backing allocation has
Allocators have overhead, so even if you know the allocated size, that doesn't translate directly into how much memory you'll need.

In any case, the following function might provide a workable approximation.

fn sizeof_val(v: &amp;serde_json::Value) -&gt; usize {
    std::mem::size_of::&lt;serde_json::Value&gt;()
        + match v {
            serde_json::Value::Null =&gt; 0,
            serde_json::Value::Bool(_) =&gt; 0,
            serde_json::Value::Number(_) =&gt; 0, // Incorrect if arbitrary_precision is enabled. oh well
            serde_json::Value::String(s) =&gt; s.capacity(),
            serde_json::Value::Array(a) =&gt; a.iter().map(sizeof_val).sum(),
            serde_json::Value::Object(o) =&gt; o
                .iter()
                .map(|(k, v)| {
                    std::mem::size_of::&lt;String&gt;()
                        + k.capacity()
                        + sizeof_val(v)
                        + std::mem::size_of::&lt;usize&gt;() * 3 // As a crude approximation, I pretend each map entry has 3 words of overhead
                })
                .sum(),
        }
}

A few thoughts (mostly linux-centric):

If you need precise memory sizes, you might be better off by directly measuring your process's memory size via procfs::process::Process::myself().unwrap().status().unwrap().vmrss.unwrap() * 1024. The caveat here is that allocators tend to not give memory back to the OS that quickly, so you might over-estimate.
If you're using a custom allocator, you might be able to directly ask it for memory usage statistics.
Instead of worrying about controlling the size, you could let the OS warn you about impeding memory overusage by registering an eventfd on memory.oom_control (but I think you may have to implement that yourself, I don't see a convenient crate for it).
(The loupe crate also implements allocation size measuring, but I don't think it supports serde_json.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Rust serde获取Vec<serde_json::Value>的运行时堆大小

问题

答案1

只有在将函数写成闭包时才会出现生命周期错误。

你如何在Rust的serde序列化器中传播错误？

Rust环形箱无法从Apple M1跨编译到x86_64-unknown-linux-gnu。

函数应该返回一个自身引用还是一个自身的新副本？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论