英文:
Protobuf: parsing of optional fields works if the incoming data contains multiple values for that field?
问题
以下是翻译好的部分:
我有一个 gRPC 服务器,传入的消息包含被标记为 optional
的字段。由于向服务器发送数据的客户端在不受信任的环境中运行,因此可能会出现无效或修改的传入消息。
在一个测试中,我注意到如果客户端为该字段发送多个值(就像它被定义为 repeated
一样),protobuf 简单地忽略了多个值并使用最后接收到的值,而不是引发异常。
是否有可能切换到类似于“严格模式”的行为,让 protobuf 在这种情况下引发错误?或者至少检测到这样的错误传入消息?
使用 com.google.protobuf:protobuf-java:3.23.3
进行测试:
message Msg1 {
repeated string field = 1;
}
message Msg2 {
optional string field = 1;
}
Msg1 msg1 = Msg1.newBuilder().addField("value1").addField("value2").addField("value3").build();
byte[] data = msg1.toByteArray(); // 在客户端进行序列化
Msg2 msg2 = Msg2.parseFrom(data); // 在服务器端进行反序列化。如何使其在格式不同的情况下失败
System.out.println(msg2.getField()); // 输出 "value3"
请注意,我已经将代码中的 HTML 实体引号 "
替换为正常的双引号 "
。
英文:
I have a gRPC server and the incoming messages contain fields that are marked as optional
. As the client which sends data to the server runs in an untrusted environment, invalid or modified incoming messages are possible.
In a test I noticed that if the client sends multiple values for that field (like it was defined as repeated
), protobuf simply ignores the multiple values and uses the last received value, instead of throwing an Exception.
Is there a possibility to switch behavior to something like a "strict mode" and let protobuf throw an error in such a case? Or at least detect such defect incoming messages?
Tested with com.google.protobuf:protobuf-java:3.23.3
:
message Msg1 {
repeated string field = 1;
}
message Msg2 {
optional string field = 1;
}
Msg1 msg1 = Msg1.newBuilder().addField("value1").addField("value2").addField("value3").build();
byte[] data = msg1.msg1.toByteArray(); // serialization on client side
Msg2 msg2 = Msg2.parseFrom(data); // deserializationon on server side. How to make this fail if the format is different
System.out.println(msg.getField()); // outputs "value3"
答案1
得分: 1
或者至少检测这种缺陷的传入消息?
检测与模式不同的任何消息的一种方法是重新编码解码的消息。这将强制它采用规范格式,如果消息与模式完全匹配,二进制表示将是逐字节等效的。
但是,我认为你不应该这样做。
如果消息是不可信的,检测一种特定类型的修改对你有什么好处呢?发送者无论如何都可以完全控制所有消息数据。如果你想要强制消息的完整性,你需要添加一个密码签名或HMAC。
当同一字段多次出现时的行为是"最后一个胜出"。有时这是有用的行为,有时不是。但似乎没有令人信服的理由为此抛出错误,我也不知道任何支持这样做的protobuf库。
英文:
> Or at least detect such defect incoming messages?
One way to detect any message that differs from the schema, is to re-encode the decoded message. That will force it to the canonical format, and if the message exactly matches the schema, the binary representations will be byte-for-byte equivalent.
However, I do not think you should do this.
If the message is untrusted, what good does it do you to detect one particular kind of modification? The sender can fully control all message data anyway. If you want to enforce message integrity, you'll need to add a cryptographic signature or HMAC.
The behavior when same field is present multiple times is "last one wins". Sometimes it is useful behavior, sometimes not. But there doesn't seem to be any compelling reason to throw an error for it, and I'm not aware of any protobuf library that supports doing that.
答案2
得分: 0
显然不是。 这也会发生在 oneof
字段上。 如果发送方设法发送多个 oneof
字段中的一个以上,解析器只会保留最后一个接收到的字段。
这相当奇怪,尤其是对于它保持沉默。 像 GPB 这样基于模式的序列化系统的整个理念是,您可以轻松地拥有一个约定的合同 / ICD / 消息集。 他们设法让在代码中检测到实际上未满足合同的情况变得非常困难...
我怀疑他们这样做的原因可以追溯到 Google 首次发明 GPB 的初衷,即在 Google 自己的系统内存储数据。 他们可能希望能够在旧数据与当前版本的模式差距很大的情况下表现得非常出色。 对于 Google 的大部分东西来说,事情偶尔出错并不重要;分析、搜索等等都不是精确的,如果某个服务器上的旧数据在某个地方奇怪地解析,谁会在乎呢?
与 Google 看起来不一样,我更喜欢使用 ASN.1;它首先是一个更有能力的串行化工具,而且它非常容易强制执行 ICD / 合同。
当 Google 首次在开发者大会上宣布 GPB 和他们为何这样做时,显然有人在人群中问道,“为什么你们不直接使用 ASN.1?” 对此的回答是 Google 从未听说过 ASN.1,Google 的开发人员从未使用 Google 来搜索“二进制序列化标准”。 这真是说明了一切。
英文:
Apparently, not.
This happens with oneof's too. If a sender contrives to send more than one of the oneof's fields, the parser simply keeps the last one received.
It is pretty bizarre, especially to be so silent about it. The whole idea of a schema based serialisation system like GPB is that you can easily have an agreed contract / ICD / message set. They've managed to make it very hard to detect in code that the contract isn't actually met...
I suspect the reason they've done this goes back to why Google invented GPB in the first place, to store data within Google's own systems. They probably wanted something that could do pretty well, even if old data was a long way out of spec compared to the current version of a schema. For most of Google's stuff, it doesn't matter if things occassionally go wrong; analytics, search, etc aren't exact anyway, who cares if an old bit of data parses weirdly in some server somewhere?
Being more tidly minded than it appears Google are, I much prefer to use ASN.1; it's a much more capable serialiser in the first place, and it makes it very easy to enforce an ICD / contract.
When Google first announced GPB at a dev conference and the reasons why they'd done it, apparently someone in the crowd asked, "why didn't you just use ASN.1?", to which the answer was that Google had never heard of it, Google's developers not having used Google to search the Internet for "binary serialisation standards". Says it all, really.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论