如何缩短 UTF-8 字符串

huangapple go评论103阅读模式
英文:

How to shorten uft8 string

问题

如何从一个UTF-8字符串中移除最小的后缀,以使其大小减少至少给定数量的字节。或者,我想要将一个字符串适应到一个缓冲区中,通过丢弃可能的最小后缀。那么,如何实现以下函数?

  1. std::string shorted_utf8_string(std::string str, size_t max_length) {
  2. std::string result;
  3. // ... 做一些操作
  4. assert(result.length() <= max_length);
  5. return result;
  6. }

基本上,问题是如何处理可变字节字符。

英文:

How to remove the smallest suffix from a utf8 string so that its size is reduced by at least a given number of bytes. Alternatively, I would like to fit a string into a buffer by discarding the smallest possible suffix. So, how to implement the following function?

  1. std::string shorted_uft8_string(std::string str, size_t max_length) {
  2. std::string result;
  3. // ... do something
  4. assert(result.length() &lt;= max_length);
  5. return result;
  6. }

Basically, the problem is how to deal with variable byte characters.

答案1

得分: 2

在Boost库中有一个迭代器可以帮助您跟踪UTF-8 Unicode代码点。
这样,您可以扫描仍然完全位于缓冲区内的最后一个代码点:

  1. std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
  2. {
  3. using Iter = boost::u8_to_u32_iterator<decltype(std::begin(s))>;
  4. auto i = Iter{s.begin()};
  5. while (i.base() != s.end()) {
  6. auto next = i;
  7. ++next;
  8. if (std::distance(s.begin(), next.base()) > maxBytes)
  9. {
  10. break;
  11. }
  12. i = next;
  13. }
  14. return {s.begin(), static_cast<size_t>(std::distance(s.begin(), i.base()))};
  15. }

https://godbolt.org/z/szW4n3vqa

英文:

In boost there is iterator which can help you to track utf-8 Unicode codepoint.
This way you can scan for last codepoint which is still fully inside a buffer:

  1. std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
  2. {
  3. using Iter = boost::u8_to_u32_iterator&lt;decltype(std::begin(s))&gt;;
  4. auto i = Iter{s.begin()};
  5. while (i.base() != s.end()) {
  6. auto next = i;
  7. ++next;
  8. if (std::distance(s.begin(), next.base()) &gt; maxBytes)
  9. {
  10. break;
  11. }
  12. i = next;
  13. }
  14. return {s.begin(), static_cast&lt;size_t&gt;(std::distance(s.begin(), i.base()))};
  15. }

https://godbolt.org/z/szW4n3vqa

答案2

得分: 2

假设您的输入是有效的UTF-8字符串,所有代码点都被编码为一个非继续字节,后面跟着0到3个继续字节。

只需将字符串截断到指定大小,并在最后一个非继续字节之前截断:

  1. #include <cstddef>
  2. #include <string>
  3. #include <string_view>
  4. constexpr bool is_continuation_byte(char c) {
  5. return (static_cast<unsigned char>(c) & 0b11000000u) == 0b10000000u;
  6. }
  7. std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
  8. std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
  9. if (result.size() > max_byte_size) {
  10. // 此循环在有效的UTF-8字符串上最多运行3次
  11. while (!result.empty() && is_continuation_byte(result.back())) {
  12. result.remove_suffix(1);
  13. }
  14. if (!result.empty()) {
  15. // result.back() 是新代码点的起始
  16. result.remove_suffix(1);
  17. // result.back() 现在是最后一个代码点的结尾
  18. }
  19. }
  20. return result;
  21. }

请注意,这是代码的翻译部分。

英文:

Assuming your input is a valid utf8 string, all code points are encoded as a non continuation byte followed by 0 to 3 continuation bytes.

Just cut off the string to the size, and cut right before the last non-continuation byte:

  1. #include &lt;cstddef&gt;
  2. #include &lt;string&gt;
  3. #include &lt;string_view&gt;
  4. constexpr bool is_continuation_byte(char c) {
  5. return (static_cast&lt;unsigned char&gt;(c) &amp; 0b11000000u) == 0b10000000u;
  6. }
  7. std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
  8. std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
  9. if (result.size() &gt; max_byte_size) {
  10. // This loop will run a maximum of 3 times on a valid utf8 string
  11. while (!result.empty() &amp;&amp; is_continuation_byte(result.back())) {
  12. result.remove_suffix(1);
  13. }
  14. if (!result.empty()) {
  15. // result.back() is the start of a new code point
  16. result.remove_suffix(1);
  17. // result.back() is now the end of the last code point
  18. }
  19. }
  20. return result;
  21. }

huangapple
  • 本文由 发表于 2023年5月29日 23:34:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76358606.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定