2023年2月8日 18:42:49go评论65阅读模式

英文:

Is it ok to have 100s of fields in a protobuf message?

问题

我们正在开发一组通过protobuf消息交换数据的C++应用程序。我们想要交换的消息之一包含一组类型-值对。类型只是一个整数，值可以是许多不同的数据类型，包括基本数据类型，如整数或字符串，还包括更复杂的数据类型，如IP地址或前缀。但对于每种特定类型，值只允许一种数据类型。

类型	值的数据类型
1	字符串
2	整数
3	列表<IP地址>
4	整数
5	结构体
6	字符串
...	...

注意：其中一个通信应用程序最终会根据固定的协议格式，将此类型-值对列表编码为字节数组在网络数据包中传输。

有几种方法可以将这些信息编码到protobuf消息中，但我们目前倾向于为每种类型的编号单独创建一个protobuf消息：

message Type1
{
    string value = 1;
}
message Type2
{
    int32 value = 1;
}
message Type3
{
    repeated IpAddr value = 1;
}
...
message TVPair
{
    oneof type
    {
        Type1 type_1 = 1;
        Type2 type_2 = 2;
        Type3 type_3 = 3;
        ...
    }
}
message Foo
{
    repeated TVPair tv_pairs = 1;
}

这对所有应用程序都很清晰且易于使用，并且它将网络协议编码的细节隐藏在唯一需要处理的应用程序中。

我唯一担心的是类型编号的列表按顺序有数百个项目。这意味着需要定义数百个protobuf消息，而TVPair消息中的oneof结构将包含这些成员的数量。我知道protobuf消息中的字段号可以更高（约500,000,000），因此这不是真正的问题。但是否有在单个protobuf消息中有数百个字段的任何缺点呢？

英文:

We are developing a set of C++ applications that exchange data through protobuf messages. One of the messages that we want to exchange contains a list of type-value pairs. The type is just an integer, the value can be a number of different data types, both basic ones like integer or string, but also more complex ones like ip addresses or prefixes. But for every specific type, there is only one data type allowed for the value.

type	value data type
1	string
2	integer
3	list<ip_addr>
4	integer
5	struct
6	string
...	...

Note: one of the communicating apps will ultimately encode this list of type-value pairs into a byte array in a network packet according to a fixed protocol format.

There are a few ways to encode this into a protobuf message, but we're currently leaning towards creating a protobof message for each type number separately:

message Type1
{
    string value = 1;
}
message Type2
{
    integer value = 1;
}
message Type3
{
    repeated IpAddr value = 1;
}
...
message TVPair
{
    oneof type
    {
        Type1 type_1 = 1;
        Type1 type_2 = 2;
        Type1 type_3 = 3;
        ...
    }
}
message Foo
{
    repeated TVPair tv_pairs = 1;
}

This is clear and easy to use for all applications and it hides the details of the network protocol encoding in the only app that actually needs to take care of it.

The only worry I have is that the list of Type numbers is in the order of a few 100 items. This means a few 100 protobuf messages need to be defined and the oneof structure in the TVPair message will contain that amount of members. I know the field numbers in protobuf messages can be a lot higher (~500.000.000) so that's not really an issue. But are there any downsides to having 100's of fields in a single protobuf message?

答案1

得分: 0

以下是已翻译的部分：

来自@DazWilkin的评论引导我了解协议缓冲区文档网站中的一些最佳实践：

不要创建具有大量字段的消息

不要创建具有“大量”字段（想象：数百个）的消息。在C++中，每个字段都会向内存对象的大小增加大约65位，无论它是否已填充（8个字节用于指针，如果字段被声明为可选字段，则还会在位字段中增加另一个位，用于跟踪字段是否已设置）。当您的协议增长得太大时，生成的代码甚至可能无法编译（例如，在Java中，方法的大小有硬性限制）。

大数据集

协议缓冲区不是设计用于处理大型消息的。作为一个经验法则，如果您正在处理每个消息都大于一兆字节的消息，也许是时候考虑另一种策略了。
尽管如此，协议缓冲区非常适合处理大数据集中的个别消息。通常，大数据集是一组小片段，其中每个小片段都是结构化数据。尽管协议缓冲区无法一次处理整个集合，但使用协议缓冲区对每个片段进行编码极大地简化了问题：现在，您只需要处理一组字节字符串，而不是一组结构。
协议缓冲区不包括任何用于处理大数据集的内置支持，因为不同的情况需要不同的解决方案。有时，简单的记录列表就足够了，而其他时候您可能需要类似数据库的更复杂解决方案。每个解决方案都应该作为单独的库开发，以便只有需要它的人需要支付成本。

因此，尽管在技术上可能是可能的，但不建议创建具有大量字段的大消息。

英文:

The comment from @DazWilkin pointed me towards some best practices in the protocol buffers documentation website:

Don’t Make a Message with Lots of Fields

> Don’t make a message with “lots” (think: hundreds) of fields. In C++ every field adds roughly 65 bits to the in-memory object size whether it’s populated or not (8 bytes for the pointer and, if the field is declared as optional, another bit in a bitfield that keeps track of whether the field is set). When your proto grows too large, the generated code may not even compile (for example, in Java there is a hard limit on the size of a method ).

Large Data Sets

> Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are a collection of small pieces, where each small piece is structured data. Even though Protocol Buffers cannot handle the entire set at once, using Protocol Buffers to encode each piece greatly simplifies your problem: now all you need is to handle a set of byte strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data sets because different situations call for different solutions. Sometimes a simple list of records will do while other times you want something more like a database. Each solution should be developed as a separate library, so that only those who need it need pay the costs.

So although it might be technically possible, it is not advised to create big messages with lots of fields.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在一个protobuf消息中拥有上百个字段是否可以？

问题

答案1

解决 protoc-gen-go: 无法确定 Go 导入路径问题，通过添加一个 “M” 参数来解决。

Recursive Pydantic model to gRPC protobuf

将不同服务中相似的枚举类型进行转换。

使用bazel-gazelle时，Go proto导入出现错误。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。