Rust serde获取Vec<serde_json::Value>的运行时堆大小

huangapple go评论61阅读模式
英文:

Rust serde get runtime heap size of Vec<serde_json::Value>

问题

I'm making a rust tool which migrates data using the REST API of an internal service. Essentially, it makes a GET request, the returned data is an array of JSON objects which is deserialized into a struct field of type serde_json::Value, gets a mutable array of it (as_array_mut) for a bit of processing and POSTs the result to another REST API.

This is done in batches of say, 10000 records per request, however the data can unpredictably change in size. Usually it's around 10MiB but sometimes it can jump to over 400MiB which can easily crash the internal service.

Because of this, I want a way to control how many records can be fetched per request based on the size of the response data, in other words, the heap size of Vec<serde_json::Value> during runtime. I've tried std::mem::size_of_val and the crate heapsize but they didn't work. I think one work around would be converting it to a string and get its length (the size doesn't have to be 100% accurate, a rough estimate is fine too) but that would mean there will be two copies of the JSON data. This is my last option but I wanted to know if there's any other alternative and efficient way to get the heap size.

Update - (response to @Caesar):
I was temporarily using this while waiting for any better approaches:
let size = serde_json::to_vec(docs)?.len();

Thanks to @Caesar I did a few benchmarking and here's what I got. size is from what I mentioned above, size_new and size_new_for is from Caesar's answer, difference being the first one uses .map(|v| { sizeof_val(v) }).sum() and the second is a simple for-in loop which increments the result to a variable.

rows: 1000
size raw: 1360727, fmt: 1.30 MiB, took: 4.980794ms
size_new raw: 3834194, fmt: 3.66 MiB, took: 716.486μs
size_new_for raw: 3834194, fmt: 3.66 MiB, took: 672.523μs

rows: 10000
size raw: 17778816, fmt: 16.96 MiB, took: 62.151661ms
size_new raw: 43805986, fmt: 41.78 MiB, took: 8.775323ms
size_new_for raw: 43805986, fmt: 41.78 MiB, took: 8.158837ms

rows: 50000
size raw: 84354219, fmt: 80.45 MiB, took: 199.82163ms
size_new raw: 175919470, fmt: 167.77 MiB, took: 26.010926ms
size_new_for raw: 175919470, fmt: 167.77 MiB, took: 27.084353ms

Ignoring the timings, there seems to be a huge difference in size from turning the entire thing to a vector of bytes (serde_json::to_string takes over 2 times longer than serde_json::to_vec but gives the same result). I'm sort of confused as to which one is an over-estimate here, isn't turning the entire thing to a string/byte array supposed to be an over-estimate or have I been using a grossly under-estimated approximation this whole time?

Here's the complete code:

let size = serde_json::to_vec(docs)?.len() as u64;
let size_new: usize = docs.iter().map(|v| {
    sizeof_val(v)
}).sum();
let mut size_new_for = 0;
for v in docs.iter() {
    size_new_for += sizeof_val(v);
}
英文:

I'm making a rust tool which migrates data using the REST API of an internal service. Essentially, it makes a GET request, the returned data is an array of JSON objects which is deserialized into a struct field of type serde_json::Value, gets a mutable array of it (as_array_mut) for a bit of processing and POSTs the result to another REST API.

This is done in batches of say, 10000 records per request, however the data can unpredictably change in size. Usually it's around 10MiB but sometimes it can jump to over 400MiB which can easily crash the internal service.

Because of this, I want a way to control how many records can be fetched per request based on the size of the response data, in other words, the heap size of Vec&lt;serde_json::Value&gt; during runtime. I've tried std::mem::size_of_val and the crate heapsize but they didn't work. I think one work around would be converting it to a string and get its length (the size doesn't have to be 100% accurate, a rough estimate is fine too) but that would mean there will be two copies of the JSON data. This is my last option but I wanted to know if there's any other alternative and efficient way to get the heap size.

Update - (response to @Caesar):
I was temporarily using this while waiting for any better approaches:
let size = serde_json::to_vec(docs)?.len();

Thanks to @Caesar I did a few benchmarking and here's what I got. size is is from what I mentioned above, size_new and size_new_for is from Caesar's answer, difference being the first one uses .map(|v| { sizeof_val(v) }).sum() and the second is a simple for-in loop which increments the result to a variable.

rows: 1000
size raw: 1360727, fmt: 1.30 MiB, took: 4.980794ms
size_new raw: 3834194, fmt: 3.66 MiB, took: 716.486&#181;s
size_new_for raw: 3834194, fmt: 3.66 MiB, took: 672.523&#181;s

rows: 10000
size raw: 17778816, fmt: 16.96 MiB, took: 62.151661ms
size_new raw: 43805986, fmt: 41.78 MiB, took: 8.775323ms
size_new_for raw: 43805986, fmt: 41.78 MiB, took: 8.158837ms

rows: 50000
size raw: 84354219, fmt: 80.45 MiB, took: 199.82163ms
size_new raw: 175919470, fmt: 167.77 MiB, took: 26.010926ms
size_new_for raw: 175919470, fmt: 167.77 MiB, took: 27.084353ms

Ignoring the timings, there seems to be a huge difference in size from turning the entire thing to a vector of bytes (serde_json::to_string takes over 2 times longer than serde_json::to_vec but gives the same result). I'm sort of confused as to which one is an over-estimate here, isn't turning the entire thing to a string/byte array supposed to be an over-estimate or have I been using a grossly under-estimated approximation this whole time?

Here's the complete code:

let size = serde_json::to_vec(docs)?.len() as u64;
let size_new: usize = docs.iter().map(|v| {
    sizeof_val(v)
}).sum();
let mut size_new_for = 0;
for v in docs.iter() {
    size_new_for += sizeof_val(v);
}

答案1

得分: 1

计算 serde_json::Value 的精确内存大小因几个原因而有些棘手:

  • 无法访问底层的 Map 类并询问其分配的容量
  • 分配器会有开销,所以即使你知道分配的大小,也不能直接转化为实际需要的内存量。

无论如何,以下函数可能会提供一个可行的近似值。

fn sizeof_val(v: &serde_json::Value) -> usize {
    std::mem::size_of::<serde_json::Value>()
        + match v {
            serde_json::Value::Null => 0,
            serde_json::Value::Bool(_) => 0,
            serde_json::Value::Number(_) => 0, // 如果启用了 arbitrary_precision,则不正确。哎呀
            serde_json::Value::String(s) => s.capacity(),
            serde_json::Value::Array(a) => a.iter().map(sizeof_val).sum(),
            serde_json::Value::Object(o) => o
                .iter()
                .map(|(k, v)| {
                    std::mem::size_of::<String>()
                        + k.capacity()
                        + sizeof_val(v)
                        + std::mem::size_of::<usize>() * 3 // 作为粗略估计,我假装每个映射条目都有 3 个字的开销
                })
                .sum(),
        }
}

一些想法(主要是针对 Linux):

  • 如果你需要精确的内存大小,你可以通过直接测量进程的内存大小来更好地进行评估,使用 procfs::process::Process::myself().unwrap().status().unwrap().vmrss.unwrap() * 1024。这里需要注意的是,分配器 tend 通常不会迅速将内存归还给操作系统,所以你可能会高估内存使用量。
  • 如果你使用自定义的分配器,你可能可以直接向其请求内存使用统计信息,参考 这里
  • 与其担心控制大小,你可以让操作系统在内存使用过多时警告你,方法是在 memory.oom_control 上注册一个 eventfd(但我认为你可能需要自己实现,我没有看到一个方便的 crate)。
  • loupe crate 也实现了分配大小的测量,但我不认为它支持 serde_json。)
英文:

Calculating the exact memory size of a serde_json::Value is somewhat tricky for several reasons

  • You can't access the underlying Map class and ask what capacity its backing allocation has
  • Allocators have overhead, so even if you know the allocated size, that doesn't translate directly into how much memory you'll need.

In any case, the following function might provide a workable approximation.

fn sizeof_val(v: &amp;serde_json::Value) -&gt; usize {
    std::mem::size_of::&lt;serde_json::Value&gt;()
        + match v {
            serde_json::Value::Null =&gt; 0,
            serde_json::Value::Bool(_) =&gt; 0,
            serde_json::Value::Number(_) =&gt; 0, // Incorrect if arbitrary_precision is enabled. oh well
            serde_json::Value::String(s) =&gt; s.capacity(),
            serde_json::Value::Array(a) =&gt; a.iter().map(sizeof_val).sum(),
            serde_json::Value::Object(o) =&gt; o
                .iter()
                .map(|(k, v)| {
                    std::mem::size_of::&lt;String&gt;()
                        + k.capacity()
                        + sizeof_val(v)
                        + std::mem::size_of::&lt;usize&gt;() * 3 // As a crude approximation, I pretend each map entry has 3 words of overhead
                })
                .sum(),
        }
}

A few thoughts (mostly linux-centric):

  • If you need precise memory sizes, you might be better off by directly measuring your process's memory size via procfs::process::Process::myself().unwrap().status().unwrap().vmrss.unwrap() * 1024. The caveat here is that allocators tend to not give memory back to the OS that quickly, so you might over-estimate.
  • If you're using a custom allocator, you might be able to directly ask it for memory usage statistics.
  • Instead of worrying about controlling the size, you could let the OS warn you about impeding memory overusage by registering an eventfd on memory.oom_control (but I think you may have to implement that yourself, I don't see a convenient crate for it).
  • (The loupe crate also implements allocation size measuring, but I don't think it supports serde_json.)

huangapple
  • 本文由 发表于 2023年6月12日 15:02:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76454260.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定