一个压缩的Java字符串的长度在作为WebSocket消息发送时不等于内容长度。

huangapple go评论82阅读模式
英文:

The length of a compressed Java String is not equal to the content-length when it is sent as a WebSocket message

问题

以下是翻译好的部分:

我试图通过压缩从我的Springboot应用程序通过WebSocket发送到浏览器客户端的JSON String 来减少带宽消耗(这是在permessage-deflate WebSocket扩展之上)。该情景使用以下长度为383个字符的JSON String

日志以下Java控制台中的行:

浏览器然后接收到服务器发送的两条消息,并被此JavaScript捕获:

在这一点上,我现在可以验证我的Springboot应用程序压缩的任何String值,浏览器都可以解压缩并获得原始的String。不过,有一个问题。当我检查浏览器调试器以查看传输消息的大小是否实际上减小时,它告诉我并没有。

以下是未压缩的原始消息(598B):

而这是原始压缩消息(589B):

调试控制台指示未压缩的消息传输大小为598B,其中消息负载的大小为383个字符(由content-length头部指示)。另一方面,压缩的消息以总大小589B传输,比未压缩的消息小9B,其中消息负载的大小为425个字符。我有一些问题:

  1. STOMP消息的content-length是以字节还是字符表示的?
  2. 为什么未压缩的消息的content-length为383,而压缩的消息的content-length为425?
  3. 这是否意味着减少字符长度并不总是意味着减小大小?
  4. 为什么压缩的消息的content-length为425,而在Java控制台中返回的值(使用lzStringCompressed.length())为157,考虑到未压缩的消息以383的content-length传输,这与Java控制台中的长度相同。两者都使用了charset=UTF-8编码。
  5. 为什么压缩的消息的content-length为425,而JavaScript代码payload.length返回157,而不是425?
  6. 如果在传输过程中确实膨胀了,为什么application/json的消息保持不受影响,只有plain/text变得膨胀了?

虽然9B的差异仍然是差异,但我正在重新考虑压缩/解压缩消息的开销是否值得保留。我必须测试其他String值。

英文:

I am trying to reduce bandwidth consumption by compressing the JSON String I am sending through the WebSocket from my Springboot application to the browser client (this is on top of permessage-deflate WebSocket extension). This scenario uses the following JSON String which has a length of 383 characters:

{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/signup"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}

To benchmark, I send both compressed and uncompressed String from the server like so:

Object response = …,

SimpMessageHeaderAccessor simpHeaderAccessor =
    SimpMessageHeaderAccessor.create(SimpMessageType.MESSAGE);
simpHeaderAccessor.setSessionId(sessionId);
simpHeaderAccessor.setContentType(new MimeType("application", "json",
    StandardCharsets.UTF_8));
simpHeaderAccessor.setLeaveMutable(true);
// Sends the uncompressed message.
messagingTemplate.convertAndSendToUser(sessionId, uri, response,
    simpHeaderAccessor.getMessageHeaders());

ObjectMapper mapper = new ObjectMapper();
String jsonString;

try {
    jsonString = mapper.writeValueAsString(response);
}
catch(JsonProcessingException e) {
    jsonString = response.toString();
}

log.info("The payload is application/json.");
log.info("uncompressed payload (" + jsonString.length() + " character):");
log.info(jsonString);

String lzStringCompressed = LZString.compress(jsonString);
simpHeaderAccessor = SimpMessageHeaderAccessor.create(SimpMessageType.MESSAGE);
simpHeaderAccessor.setSessionId(sessionId);
simpHeaderAccessor.setContentType(new MimeType("text", "plain",
    StandardCharsets.UTF_8));
simpHeaderAccessor.setLeaveMutable(true);
// Sends the compressed message.
messagingTemplate.convertAndSendToUser(sessionId, uri, lzStringCompressed,
    simpHeaderAccessor.getMessageHeaders());

log.info("The payload is text/plain.");
log.info("compressed payload (" + lzStringCompressed.length() + " character):");
log.info(lzStringCompressed);

Which logs the following lines in the Java console:

The payload is application/json.
uncompressed payload (383 character):
{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/signup"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}
The payload is text/plain.
compressed payload (157 character):
??????????¼??????????????p??!-??7??????????????????????????????????u??????????????????????·}???????????????????????????????????????/?┬R??b,??????m??????????

Then browser receives the two messages sent by the server and captured by this javascript:

stompClient.connect({}, function(frame) {
    stompClient.subscribe(stompClientUri, function(payload) {
        try {
            JSON.parse(payload.body);
            console.log("The payload is application/json.");
            console.log("uncompressed payload (" + payload.body.length + " character):");
            console.log(payload.body);

            payload = JSON.parse(payload.body);
        } catch (e) {
            try {
                payload = payload.body;
                console.log("The payload is text/plain.");
                console.log("compressed payload (" + payload.length + " character):");
                console.log(payload);

                var decompressPayload = LZString.decompress(payload);
                console.log("decompressed payload (" + decompressPayload.length + " character):");
                console.log(decompressPayload);

                payload = JSON.parse(decompressPayload);
            } catch (e) {
            } finally {
            }
        } finally {
        }
    });
});

Which displays the following lines in the browser's debug console:

The payload is application/json.
uncompressed payload (383 character):
{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/sign-up"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}
The payload is text/plain.
compressed payload (157 character):
ᯡࠥ䅬ࢀጨᎡ乀ஸ̘͢¬ߑ䁇啰˸⑱ᐣ䱁ሢ礒⽠݉ᐮ皆⩀p瑭漦!-䈠ᷕ7ᡑ刡⺨狤灣મ啃嵠ܸ䂃ᡈ硱䜄ቀρۯĮニᴴဠ䫯⻖֑点⇅劘畭ᣔ奢⅏㛥⡃Ⓛ撜u≂㥋╋ၲ⫋䋕᪒丨ಸ䀭䙇Ꮴ吠塬昶⬻㶶Т㚰ͻၰú}㙂᥸沁⠈ƹ⁄᧸㦓ⴼ䶨≋愐㢡ᱼ溜涤簲╋㺮橿䃍砡瑧ᮬ敇⼺ℙ滆䠢榵ⱀ盕ີ‣Ш眨રą籯/ሤÂR儰Ȩb,帰Ћ愰䀥․䰂m㛠ளǀ䀭❖⧼㪠Ө柀䀠 
decompressed payload (383 character):
{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/sign-up"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}

At this point I can now verify that whatever String value my Springboot application compresses, the browser can able to decompress and get the original String. There is a problem though. When I inspected the browser debugger if the size of the transferred message was actually reduced, it tells me that isn't.

Here is the raw uncompressed message (598B):

a["MESSAGE destination:/user/session/broadcast
content-type:application/json;charset=UTF-8
subscription:sub-0
message-id:5lrv4kl1-1
content-length:383

{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/sign-up"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}

While this is the raw compressed message (589B):

a["MESSAGE destination:/user/session/broadcast
content-type:text/plain;charset=UTF-8
subscription:sub-0
message-id:5lrv4kl1-2
content-length:425

á¯¡à ¥ä…¬à¢€áŒ¨áŽ¡ä¹€à®¸Ì˜Í¢Â¬ß‘ä‡å•°Ë¸â‘±á£ä±áˆ¢ç¤’â½Ý‰á®çš†â©€p瑭漦!-äˆ á·•7ᡑ刡⺨狤灣મ啃åµÜ¸ä‚ƒá¡ˆç¡±äœ„ቀρۯĮニᴴá€ä«¯â»–֑点⇅劘畭ᣔ奢⅏㛥⡃Ⓛ撜u≂㥋╋ၲ⫋䋕᪒丨ಸ䀭䙇Ꮴåå¡¬æ˜¶â¬»ã¶¶Ð¢\u2029㚰ͻၰú}㙂᥸沁âˆÆ¹â„᧸㦓ⴼ䶨≋愐㢡ᱼ溜涤簲╋㺮橿䃍ç¡ç‘§á®¬æ•‡â¼ºâ„™æ»†ä¢æ¦µâ±€ç›•àºµâ€£Ð¨çœ¨àª°Ä…籯/ሤÂR儰Ȩb,帰Ћ愰䀥․䰂mã›à®³Ç€ä€­â–â§¼ãª Ó¨æŸ€ä€  \u0000"]

The debug console indicates that the uncompressed message was transferred with the size of 598B, with 383 character as the message payload's size (indicated by the content-length header). While on the other hand, the compressed message was transferred with a total size of 589B, 9B smaller than the uncompressed one, with 425 character as the message payload's size. I have several questions:

  1. Is the content-length of the STOMP message indicated in bytes, or in characters?
  2. Why does the content-length of the uncompressed message, which is 383, smaller than that of the compressed message, which is 425?
  3. Does this mean reducing the character length does not always necessarily means reducing the size?
  4. Why does the content-length of the compressed message, which is 425, not the same with the value returned in the Java console (using lzStringCompressed.length()) which is 157, considering that the uncompressed message was transferred with a content-length of 383, which is the same length in Java console. Both too are transferred with charset=UTF-8 encoding.
  5. Why does the content-length of the compressed message, which is 425, not the same with value returned in the Java console (using lzStringCompressed.length()) which is 157 but the JavaScript code payload.length returns 157, not 425?
  6. If it really gets bloated during the transfer, why does the message with application/json remained unaffected and only the plain/text gets bloated?

While the 9B difference is still a difference, I am reconsidering if the overhead cost for compressing/decompressing the message is worth to keep. I have to test other String values for that.

答案1

得分: 5

所有问题都有密切关联。

> 1. STOMP 消息的 content-length 是以字节还是字符来表示的?

如您在STOMP规范中所见:

> 所有帧都可以包含 content-length 标头。此标头是消息正文长度的八位计数....

STOMP 视角来看,消息正文是一个字节数组,标头 content-typecontent-length 决定了正文的内容以及应如何解释它。

> 2. 为什么未压缩消息的 content-length,即 383,小于压缩消息的 content-length,即 425

这是因为在将信息发送到您的 STOMP 服务器的客户端时进行了 UTF-8 转换。

您有一条消息,即一个 String,这条消息由一系列字符组成。

不深入详述,请查看此链接另一链接以获取更多信息 - 在内部,Java 中的每个 char 都以 Unicode 代码单元表示。

要在特定字符集(在您的情况下为 UTF-8)中表示这些 Unicode 代码单元,可能需要可变数量的字节,对于您的特定情况,从一个字节到四个字节不等。

在未压缩消息的情况下,您有 383 个纯 ASCII 的 char,将使用每个 char 一个字节进行 UTF-8 编码。这就是为什么在 content-length 标头中获得相同的值。

但压缩消息的情况不同:当您压缩消息时,它将给您一定数量的字节,对应于 157 个带有任意信息的 Unicode 代码单元。获得的字节数将少于原始消息。但然后您将其编码为 UTF-8。这 157char 中的一些将用一个字节表示,与原始消息的情况相同,但由于压缩消息信息的任意性,很可能在许多情况下需要两个、三个或四个字节来表示其中一些 char。这就是为什么您获得的字节数比未压缩消息的字节数多的原因。

> 3. 这是否意味着减少字符长度并不总是意味着减小大小?

通常情况下,压缩数据总能获得较小的信息大小。

如果信息足够使压缩变得值得,且您能够压缩的 原始二进制 信息发送 - 类似于服务器发送指示 Content-Encoding: gzipdeflate 的信息,这可能会带来很大的好处。

但如果客户端库只能处理文本消息而不是二进制消息,例如 SockJS,正如您所见,编码问题实际上可能会导致不适当的结果。

为了缓解问题,您可以首先尝试将信息压缩到其他中间编码,比如 Base 64,这将使压缩后的字节数大约增加到原来的 1.6 倍:如果这个值小于未压缩字节数,压缩消息可能是值得的。

无论如何,如规范中所示,STOMP 是基于文本的,但也允许传输二进制消息。此外,它指出 STOMP 的默认编码是 UTF-8,但它支持指定消息正文的替代编码。

如果您正在使用,如您的代码所示,stomp-js - 请注意,我未使用过这个库,根据文档所示,似乎可以处理二进制消息。

基本上,您的服务器必须使用值为 application/octet-streamcontent-type 标头发送原始字节信息。

然后,在客户端库中可以通过类似于以下内容处理该信息:

    // 在消息回调中
    if (message.headers['content-type'] === 'application/octet-stream') {
      // 消息是二进制的
      // 调用 message.binaryBody 
    } else {
      // 消息是文本的
      // 调用 message.body
    }

如果这起作用,您可以以这种方式发送压缩信息,如前面所述,压缩可能会带来很大的好处。

> 4. 为什么压缩消息的 content-length,即 425,与在 Java 控制台中返回的值(使用 lzStringCompressed.length()),即 157,不同,考虑到未压缩消息的 content-length383,在 Java 控制台中长度相同。两者都是使用 charset=UTF-8 编码传输的。

请考虑 String 类的 length 方法的 Javadoc:

> 返回此字符串的长度。长度等于字符串中的Unicode 代码单元数。

正如您所见,length 方法将给出表示 String 所需的 Unicode 代码单元数,与此同时,content-length 标头将给出表示它们所需的字节数,如前面所述

英文:

All the questions are close related.

> 1. Is the content-length of the STOMP message indicated in bytes, or in characters?

As you can see in the STOMP specification:

> All frames MAY include a content-length header. This header is an octet count for the length of the message body....

From a STOMP perspective the body is a byte array and the headers content-type and content-length determine what the body contains and how it should be interpreted.

> 2. Why does the content-length of the uncompressed message, which is 383, smaller than that of the compressed message, which is 425?

Because of the conversion to UTF-8 which is carried out when you send the information to the client in your STOMP server.

You have a message, a String, and this message is composed of a series of characters.

Without going into great detail - please, review this or this other one excellent answers if you need further information - internally, every char in Java is represented in Unicode code units.

To represent these Unicode code units in a certain character set, UTF-8 in your case, a variable number of bytes may be required, from one to four in your specific case.

In the case of the uncompressed message, you have 383 chars, pure ASCII, which will be encoded to UTF-8 with one byte per char. This is why you obtain the same value in the content-length header.

But it is not the case of the compressed message: when you compress your message, it will give you an arbitrary number of bytes, corresponding to 157 chars - Unicode code units - with arbitrary information. The number of bytes obtained will be less than the original message. But then you encode it in UTF-8. Some of these 157 chars will be represented with one byte, as was the case with the original message, but due to the arbitrariness of the information of the compressed message it is more likely that, in many cases, two, three or four bytes are necessary to represent some of them. This is the cause why you obtain a number of bytes greater than the number of bytes for the uncompressed message.

> 3. Does this mean reducing the character length does not always necessarily means reducing the size?

In general, you will always get a small size of information when you compress your data.

If the information is enough to make the use of compression worthwhile, and you have the ability to send the raw binary information compressed - similar to when a server sends information indicating Content-Encoding: gzip or deflate, it could bring you a great benefit.

But if the client library could only handle text messages and not binary ones, like SockJS for instance, as you can see the encoding problem may actually give you inappropriate results.

To mitigate the problem you can first try to compress your information to other intermediate encodings, like Base 64, which will give you roughly 1.6 times the number of bytes compressed: if this value is less than the number of bytes without compression, compressing the message may be worth it.

In any case, as indicated in the specification, STOMP is text based but also allows for the transmission of binary messages. Also, it indicates that the default encoding for STOMP is UTF-8, but it supports the specification of alternative encodings for message bodies.

If you are using, as your code suggests, stomp-js - please, be aware that I have not used this library, as the documentation indicates, it seems possible to process binary messages as well.

Basically, your server must send the raw bytes information with a content-type header with value application/octet-stream.

This information can be then processed in the client side by the library with something similar to this:

    // within message callback
    if (message.headers['content-type'] === 'application/octet-stream') {
      // message is binary
      // call message.binaryBody 
    } else {
      // message is text
      // call message.body
    }

If this works, and you can send the compressed information in this way, as indicated previously, the compression could bring you a great benefit.

> 4. Why does the content-length of the compressed message, which is 425, not the same with the value returned in the Java console (using lzStringCompressed.length()) which is 157, considering that the uncompressed message was transferred with a content-length of 383, which is the same length in Java console. Both too are transferred with charset=UTF-8 encoding.

Consider the Javadoc of the length method of the String class:

> Returns the length of this string. The length is equal to the number of Unicode code units in the string.

As you can see, the length method will give you the number of Unicode code units required to represent the String, meanwhile the content-length header will give you the number of bytes required to represent them in UTF-8 as indicated previously.

In fact, calculating the length of the string could be a tricky task.

> 5. Why does the content-length of the compressed message, which is 425, not the same with value returned in the Java console (using lzStringCompressed.length()) which is 157 but the JavaScript code payload.length returns 157, not 425?

Because, as you can see in the documentation, length in Javascript also indicates the length of the String object in UTF-16 code units:

> The length property of a String object contains the length of the string, in UTF-16 code units. length is a read-only data property of string instances.

> 6. If it really gets bloated during the transfer, why does the message with application/json remained unaffected and only the text/plain gets bloated?

As above mentioned, it has nothing to do with the Content-Type but with the encoding of the information.

huangapple
  • 本文由 发表于 2020年9月18日 16:32:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/63952094.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定