Rust中的String类似容器,可以包含任意二进制。

huangapple go评论61阅读模式
英文:

Rust a String like container which may contain arbitrary binary

问题

#背景信息

这个问题涉及到 serde 和通过序列化和反序列化 Rust 结构产生的输出。

  • 我的目标是编写一个嵌套结构,将其序列化为字符串,使其对人类可读。
  • 主结构的子组件应以可变的方式进行序列化。
  • 如果它使用人类可读的格式进行序列化,转换为字符串,那么整个结构将是人类可读的。
  • 如果主结构的子组件以二进制格式序列化,那么只有主结构的部分将是人类可读的。子组件本身将不是。

字符串

Rust 的 String 必须是有效的 utf-8。(标准库的其余部分假定它们是,因此在使用 from_utf8_unchecked 可以创建一个不是有效 utf-8 的 String 时,这可能不是一个好主意。)

将 Json 或其他格式序列化

我目前正在以 JSON 格式对 Rust 结构进行序列化和反序列化。这会产生有效的 utf-8 格式的字符串。

然而,JSON 是一个临时设计选择,很可能将来会更改,最有可能是切换到 Protobuf 或其他二进制格式,如 Avro。

在这种情况下,从序列化得到的输出将不再是有效的 utf-8。它不是一个 utf-8 字符串,而是一个任意的二进制字符串。

用于保存任意二进制 String 的结构是什么?最明显的选择似乎是 vec<u8>,然而这方面存在一些问题,我将在下面描述。

另一个可能的选择可能是 OsString。但我不确定这是否真的是正确的容器。名称 OsString 并没有暗示与 BinaryString 一样的一般性,因此我对这个容器的预期用途不是很清楚,特别是考虑到已经存在 vec<u8>

# Serde 序列化和反序列化的结果

Serde 提供了两种序列化和反序列化的选择。这些是:

  • to_stringfrom_str
  • to_vecfrom_slice

以下演示代码显示了这两种方法产生的输出:

use serde::Serialize;
use serde::Deserialize;

use serde_json;

#[derive(Serialize, Deserialize)]
struct MessageBody {
    pub encoded_message_body: String,
    pub encoded_message_body_vec: Vec<u8>,
}

#[derive(Serialize, Deserialize)]
struct Message {
    pub message_body: MessageBody,
}

#[derive(Serialize, Deserialize)]
struct MyType {
    pub my_data: i32,
}

fn main() {

    let my_type = MyType {
        my_data: 10,
    };
    
    // Encode as String using serde
    let encoded_my_type = serde_json::to_string(&my_type).unwrap();
    let encoded_my_type_vec = serde_json::to_vec(&my_type).unwrap();
    
    let message_body = MessageBody {
        encoded_message_body: encoded_my_type,
        encoded_message_body_vec: encoded_my_type_vec,
    };
    
    let message = Message {
        message_body: message_body,
    };
    
    //let encoded_message = serde_json::to_vec(&message).unwrap();
    let encoded_message = serde_json::to_string(&message).unwrap();
    
    println!("{:?}", encoded_message);
    
}

输出如下。

"{\"message_body\":{\"encoded_message_body\":\"{\\\"my_data\\\":10}\",\"encoded_message_body_vec\":[123,34,109,121,95,100,97,116,97,34,58,49,48,125]}}"

其中一个输出是人类可读的,另一个不是。encoded_message_body 是人类可读的。encoded_message_body_vec 已序列化为表示数字值的数组的文字字符串表示形式,这本身并不容易阅读,尽管内存中的二进制数据是一个包含按 JSON 字符串格式排列的 ASCII(utf-8)字符块。

请注意:我选择使用 println!“显示”编码的消息。实际上,这些数据正在被传送到 Kafka。但是,如果包括 Kafka,我无法编写最小可复现示例。从 Kafka 控制台消费者中显示与从 Kafka 控制台消费者中显示的完全相同的文本。

根据上述信息,我希望产生类似于以下内容的输出:

"{\"message_body\":{\"encoded_message_body\":\"{\\\"my_data\\\":10}\",\"encoded_message_body_vec\":{\"my_data\":10}}}"

# 总结

看来我面临一个选择:

  • 通过将结果存储在 Vec<u8> 中实现任意序列化格式(JSON、Proto 或其他),但这不是人类可读的
  • 将结果存储在 String 中以获得人类可读的输出,但这仅支持诸如序列化为有效 utf-8 的 JSON 等编码

是否可能既有人类可读的输出,又有任意序列化格式?当然,对于二进制格式,输出将是一个乱码字符串,不可读,但对于 JSON 等格式,它将保持人类可读性的好处。

希望问题清楚了?如果没有,请进一步提问,我会尽力澄清...

# 进一步反思

其他人(线下)提出了这样一个观点,即我们不能将原始二进制数据转储到 JSON 格式的内容中,因为该二进制数据很可能包含对 JSON 解码器有特殊含义的数据(字符)。 (例如:{...)

这是以一种略有不同的方式思考问题,这是一个非常好的观点。

  • 这意味着,如果我们想要将外部消息编码为 JSON(我们确实想要这样做),那么二进制有效负载部分必须编码为类似 base64 或 base85 等格式...
  • 其中有人
英文:

Background Information

This question relates to serde and the output produced by Serializing and Deserializing Rust structs.

  • My aim is to write a nested structure which serializes as a String, which is human readable.
  • The sub-component of the main structure should itself be serialized in a variable way.
  • If it is serialized using a human readable format, which serializes as a String, then the whole structure will be human readable.
  • If the sub-component of the main structure is serialized with a binary format, then only the parts of the main structure will be human readable. The sub-components itself will not.

Strings

Rust Strings have to be (should be) valid utf-8. (The rest of the standard library assumes that they are, so while using from_utf8_unchecked can create a String which isn't valid utf-8, this is probably not a good idea.)

Serializing as Json, or other formats

I am currently serializing and deserializing Rust structs in JSON format. This produces valid uft-8 formatted strings.

However, JSON is a temporary design choice, and it is likely the case that in the future this will be changed, most probably to Protobuf, or some other binary format such as Avro.

In this case, the output from serializing will no longer be valid utf-8. It is not a uft-8 string, but an arbitrary binary string.

What structure can be used to hold an arbitrary binary String? The most obvious choice would seem to be vec&lt;u8&gt;, however there are issues with this which I will describe below.

Another possible alternative might be OsString. However I am not sure if this is really the correct container to use. The name OsString doesn't quite imply the same level of generality as a name like BinaryString would, hence I am not totally clear about the intended purpose of this container, particularly given that vec&lt;u8&gt; exists.

Results of Serializing and Deserializing with Serde

Serde provides two choices for serializing and deserializing. These are:

  • to_string and from_str
  • to_vec and from_slice

The following demonstration code shows the output produced by both of these:

use serde::Serialize;
use serde::Deserialize;

use serde_json;

#[derive(Serialize, Deserialize)]
struct MessageBody {
    pub encoded_message_body: String,
    pub encoded_message_body_vec: Vec&lt;u8&gt;,
}

#[derive(Serialize, Deserialize)]
struct Message {
    pub message_body: MessageBody,
}

#[derive(Serialize, Deserialize)]
struct MyType {
    pub my_data: i32,
}

fn main() {

    let my_type = MyType {
        my_data: 10,
    };
    
    // Encode as String using serde
    let encoded_my_type = serde_json::to_string(&amp;my_type).unwrap();
    let encoded_my_type_vec = serde_json::to_vec(&amp;my_type).unwrap();
    
    let message_body = MessageBody {
        encoded_message_body: encoded_my_type,
        encoded_message_body_vec: encoded_my_type_vec,
    };
    
    let message = Message {
        message_body: message_body,
    };
    
    //let encoded_message = serde_json::to_vec(&amp;message).unwrap();
    let encoded_message = serde_json::to_string(&amp;message).unwrap();
    
    println!(&quot;{:?}&quot;, encoded_message);
    
}

The output is shown below.

&quot;{\&quot;message_body\&quot;:{\&quot;encoded_message_body\&quot;:\&quot;{\\\&quot;my_data\\\&quot;:10}\&quot;,\&quot;encoded_message_body_vec\&quot;:[123,34,109,121,95,100,97,116,97,34,58,49,48,125]}}&quot;

One output is human readable, the other is not. encoded_message_body is human readable. encoded_message_body_vec has serialized as a literal string representation of an array containing numerical values, which are themselves formatted as String. This is not human readable, despite the fact that the binary data in memory is a block of ASCII (utf-8) characters containing something which is formatted as a JSON string.

Please note: I have chosen to "display" the encoded message using println!. This data is actually being directed to Kafka. However, I could not write a MWE if Kafka is included. Exactly the same text is shown from a Kafka console consumer.

In line with the above information, I am aiming to produce something which looks more like this:

&quot;{\&quot;message_body\&quot;:{\&quot;encoded_message_body\&quot;:\&quot;{\\\&quot;my_data\\\&quot;:10}\&quot;,\&quot;encoded_message_body_vec\&quot;:{&quot;my_data&quot;:10}}}&quot;

Summary

It appears that I am faced with a choice:

  • arbitrary serialization format (JSON, Proto, or other) by storing the result in a Vec&lt;u8&gt;, but this is not human readable
  • human readable output by storing the result in String, but this only supports encodings such as JSON which serialize to valid utf-8

Is it possible to have both human readable output, and arbitrary serialization format? Of course, for binary formats, the output will be a garbled string, and will not be human readable, but for formats such as JSON, it will maintain the human readability benefits.

Hopefully the question is clear? If not please ask further questions and I will try to clarify...

Further Reflections

Someone else (offline) raised the point that we can't dump raw binary into a JSON formatted thing, because that binary will likely contain data (characters) which have special meaning to JSON decoders. (eg: {...)

This is a slightly different way of thinking about the problem, and it is a very good point to have raised.

  • This means that if we want to have the outer message encoded with JSON (which we do) then the binary payload part must be encoded as something like base 64, or 85, etc...
  • Someone else pointed this out in one of the answers too.

This pushes me towards thinking that we might not actually gain anything by changing the "payload" to a binary encoding, even if the payload is large, because this would inevitably require encoding and decoding through an additional layer.

[struct] &lt;-&gt; BinarySerializedData &lt;-&gt; BaseXXEncoded (String, `encoded_message_body`) 
    &lt;-&gt; JSON Encoded String &lt;-&gt; [Wire]

To understand the performance impact some tests would need to be run. It might make sense to do this, or it might not, possibly depending5 on the average size of binary payload.

答案1

得分: 2

from_utf8_unchecked可以创建一个不合法的UTF-8字符串,这可能不是一个好主意。

这几乎是立即的未定义行为。

什么结构可以用来保存任意二进制字符串?最明显的选择似乎是vec<u8>,然而,我将在下面描述其中的问题。

bstr

另一个可能的选择可能是OsString。但我不确定这是否真的是正确的容器。

绝对不是。

是否可以同时具有人类可读的输出和任意的序列化格式?当然,对于二进制格式,输出将是一串乱码,不会是人类可读的,但对于诸如JSON之类的格式,它将保持人类可读性的好处。

基本上有两种方法:

  • 一种是ASCII二进制格式,它可以直接输出ASCII数据,并管理非ASCII数据,例如bstr(我认为)或Vec<u8>的专用打印机,但这可能不是最好的主意,因为二进制数据可能包含被终端解释的序列,除非打印机考虑到这一点,否则会导致奇怪的效果。
  • 或者,使用二进制到文本编码(例如base32、base64、base85、uuencode等)将二进制数据嵌入普通字符串中。这通常需要一些帧,但是这是一种计算机起源以来的常见方法。
英文:

> from_utf8_unchecked can create a String which isn't valid utf-8, this is probably not a good idea

it's straight up immediate UB.

> What structure can be used to hold an arbitrary binary String? The most obvious choice would seem to be vec<u8>, however there are issues with this which I will describe below.

There is bstr.

> Another possible alternative might be OsString. However I am not sure if this is really the correct container to use.

Absolutely not.

> Is it possible to have both human readable output, and arbitrary serialization format? Of course, for binary formats, the output will be a garbled string, and will not be human readable, but for formats such as JSON, it will maintain the human readability benefits.

There are basically two ways:

  • an ascii-binary format, which can output the ASCII data directly and will manage the non-ascii e.g. bstr (I think) or a bespoke printer for Vec&lt;u8&gt;, however this may not be the best idea as the binary data may contain sequences which are interpreted by e.g. the terminal, leading to odd effects unless the printer takes that an account
  • alternatively, smuggle the binary data into a normal string using a binary-to-text encoding e.g. base32, base64, base85, uuencode, ... this generally requires some framing but is a common method dating back to the origins of computing

答案2

得分: 1

I could fine nothing that supported that use-case in serde or other popular crates, but we can build our own!

在serde或其他流行的包中,我找不到支持这种用例的内容,但是我们可以自己构建!

A possible strategy is to use a wrapper around Vec&lt;u8&gt;, and serialize it like an array of strings and numbers, for each Unicode and non-Unicode chunk in the data. I will use the bstr crate for that, because it is suited for handling partially-UTF-8 data.

一种可能的策略是使用一个包装器来处理 Vec&lt;u8&gt;,并将其序列化为字符串和数字数组,用于处理数据中的每个Unicode和非Unicode块。我将使用 bstr 包,因为它适用于处理部分UTF-8数据。

use bstr::{BString, ByteSlice};
use serde::ser::{Serialize, SerializeSeq, Serializer};

pub struct HumanReadableBStr(BString);

impl Serialize for HumanReadableBStr {
    fn serialize&lt;S&gt;(&amp;self, serializer: S) -&gt; Result&lt;S::Ok, S::Error&gt;
    where
        S: Serializer,
    {
        let mut seq = serializer.serialize_seq(None)?;
        for chunk in self.0.utf8_chunks() {
            if !chunk.valid().is_empty() {
                seq.serialize_element(chunk.valid())?;
            }
            if !chunk.invalid().is_empty() {
                seq.serialize_element(chunk.invalid())?;
            }
        }
        seq.end()
    }
}
使用 [`bstr`](https://docs.rs/bstr) 包来处理部分UTF-8数据的一种可能策略是使用一个围绕 `Vec&lt;u8&gt;` 的包装器,并将其序列化为字符串和数字的数组,以处理数据中的每个Unicode和非Unicode块。
英文:

I could fine nothing that supported that use-case in serde or other popular crates, but we can build our own!

A possible strategy is to use a wrapper around Vec&lt;u8&gt;, and serialize it like an array of strings and numbers, for each Unicode and non-Unicode chunk in the data. I will use the bstr crate for that, because it is suited for handling partially-UTF-8 data.

use bstr::{BString, ByteSlice};
use serde::ser::{Serialize, SerializeSeq, Serializer};

pub struct HumanReadableBStr(BString);

impl Serialize for HumanReadableBStr {
    fn serialize&lt;S&gt;(&amp;self, serializer: S) -&gt; Result&lt;S::Ok, S::Error&gt;
    where
        S: Serializer,
    {
        let mut seq = serializer.serialize_seq(None)?;
        for chunk in self.0.utf8_chunks() {
            if !chunk.valid().is_empty() {
                seq.serialize_element(chunk.valid())?;
            }
            if !chunk.invalid().is_empty() {
                seq.serialize_element(chunk.invalid())?;
            }
        }
        seq.end()
    }
}

huangapple
  • 本文由 发表于 2023年5月17日 21:14:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76272521.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定