如何缩短 UTF-8 字符串

huangapple go评论78阅读模式
英文:

How to shorten uft8 string

问题

如何从一个UTF-8字符串中移除最小的后缀,以使其大小减少至少给定数量的字节。或者,我想要将一个字符串适应到一个缓冲区中,通过丢弃可能的最小后缀。那么,如何实现以下函数?

std::string shorted_utf8_string(std::string str, size_t max_length) {
    std::string result;
    // ... 做一些操作

    assert(result.length() <= max_length);
    return result;
}

基本上,问题是如何处理可变字节字符。

英文:

How to remove the smallest suffix from a utf8 string so that its size is reduced by at least a given number of bytes. Alternatively, I would like to fit a string into a buffer by discarding the smallest possible suffix. So, how to implement the following function?

std::string shorted_uft8_string(std::string str, size_t max_length) {
    std::string result;
    // ... do something

    assert(result.length() &lt;= max_length);
    return result;
}

Basically, the problem is how to deal with variable byte characters.

答案1

得分: 2

在Boost库中有一个迭代器可以帮助您跟踪UTF-8 Unicode代码点。
这样,您可以扫描仍然完全位于缓冲区内的最后一个代码点:

std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
{
    using Iter = boost::u8_to_u32_iterator<decltype(std::begin(s))>;

    auto i = Iter{s.begin()};
    while (i.base() != s.end()) {
        auto next = i;
        ++next;
        if (std::distance(s.begin(), next.base()) > maxBytes)
        {
            break;
        }
        i = next;
    }
    return {s.begin(), static_cast<size_t>(std::distance(s.begin(), i.base()))};
}

https://godbolt.org/z/szW4n3vqa

英文:

In boost there is iterator which can help you to track utf-8 Unicode codepoint.
This way you can scan for last codepoint which is still fully inside a buffer:

std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
{
    using Iter = boost::u8_to_u32_iterator&lt;decltype(std::begin(s))&gt;;

    auto i = Iter{s.begin()};
    while (i.base() != s.end()) {
        auto next = i;
        ++next;
        if (std::distance(s.begin(), next.base()) &gt; maxBytes)
        {
            break;
        }
        i = next;
    }
    return {s.begin(), static_cast&lt;size_t&gt;(std::distance(s.begin(), i.base()))};
}

https://godbolt.org/z/szW4n3vqa

答案2

得分: 2

假设您的输入是有效的UTF-8字符串,所有代码点都被编码为一个非继续字节,后面跟着0到3个继续字节。

只需将字符串截断到指定大小,并在最后一个非继续字节之前截断:

#include <cstddef>
#include <string>
#include <string_view>

constexpr bool is_continuation_byte(char c) {
    return (static_cast<unsigned char>(c) & 0b11000000u) == 0b10000000u;
}

std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
    std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
    if (result.size() > max_byte_size) {
        // 此循环在有效的UTF-8字符串上最多运行3次
        while (!result.empty() && is_continuation_byte(result.back())) {
            result.remove_suffix(1);
        }
        if (!result.empty()) {
            // result.back() 是新代码点的起始
            result.remove_suffix(1);
            // result.back() 现在是最后一个代码点的结尾
        }
    }
    return result;
}

请注意,这是代码的翻译部分。

英文:

Assuming your input is a valid utf8 string, all code points are encoded as a non continuation byte followed by 0 to 3 continuation bytes.

Just cut off the string to the size, and cut right before the last non-continuation byte:

#include &lt;cstddef&gt;
#include &lt;string&gt;
#include &lt;string_view&gt;

constexpr bool is_continuation_byte(char c) {
    return (static_cast&lt;unsigned char&gt;(c) &amp; 0b11000000u) == 0b10000000u;
}

std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
    std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
    if (result.size() &gt; max_byte_size) {
        // This loop will run a maximum of 3 times on a valid utf8 string
        while (!result.empty() &amp;&amp; is_continuation_byte(result.back())) {
            result.remove_suffix(1);
        }
        if (!result.empty()) {
            // result.back() is the start of a new code point
            result.remove_suffix(1);
            // result.back() is now the end of the last code point
        }
    }
    return result;
}

huangapple
  • 本文由 发表于 2023年5月29日 23:34:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76358606.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定