2023年5月29日 23:34:50go评论103阅读模式

英文:

How to shorten uft8 string

问题

如何从一个UTF-8字符串中移除最小的后缀，以使其大小减少至少给定数量的字节。或者，我想要将一个字符串适应到一个缓冲区中，通过丢弃可能的最小后缀。那么，如何实现以下函数？

std::string shorted_utf8_string(std::string str, size_t max_length) {
    std::string result;
    // ... 做一些操作
    assert(result.length() <= max_length);
    return result;
}

基本上，问题是如何处理可变字节字符。

英文:

How to remove the smallest suffix from a utf8 string so that its size is reduced by at least a given number of bytes. Alternatively, I would like to fit a string into a buffer by discarding the smallest possible suffix. So, how to implement the following function?

std::string shorted_uft8_string(std::string str, size_t max_length) {
    std::string result;
    // ... do something
    assert(result.length() &lt;= max_length);
    return result;
}

Basically, the problem is how to deal with variable byte characters.

答案1

得分: 2

在Boost库中有一个迭代器可以帮助您跟踪UTF-8 Unicode代码点。
这样，您可以扫描仍然完全位于缓冲区内的最后一个代码点：

std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
{
    using Iter = boost::u8_to_u32_iterator<decltype(std::begin(s))>;
    auto i = Iter{s.begin()};
    while (i.base() != s.end()) {
        auto next = i;
        ++next;
        if (std::distance(s.begin(), next.base()) > maxBytes)
        {
            break;
        }
        i = next;
    }
    return {s.begin(), static_cast<size_t>(std::distance(s.begin(), i.base()))};
}

https://godbolt.org/z/szW4n3vqa

英文:

In boost there is iterator which can help you to track utf-8 Unicode codepoint.
This way you can scan for last codepoint which is still fully inside a buffer:

std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
{
    using Iter = boost::u8_to_u32_iterator&lt;decltype(std::begin(s))&gt;;
    auto i = Iter{s.begin()};
    while (i.base() != s.end()) {
        auto next = i;
        ++next;
        if (std::distance(s.begin(), next.base()) &gt; maxBytes)
        {
            break;
        }
        i = next;
    }
    return {s.begin(), static_cast&lt;size_t&gt;(std::distance(s.begin(), i.base()))};
}

https://godbolt.org/z/szW4n3vqa

答案2

得分: 2

假设您的输入是有效的UTF-8字符串，所有代码点都被编码为一个非继续字节，后面跟着0到3个继续字节。

只需将字符串截断到指定大小，并在最后一个非继续字节之前截断：

#include <cstddef>
#include <string>
#include <string_view>
constexpr bool is_continuation_byte(char c) {
    return (static_cast<unsigned char>(c) & 0b11000000u) == 0b10000000u;
}
std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
    std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
    if (result.size() > max_byte_size) {
        // 此循环在有效的UTF-8字符串上最多运行3次
        while (!result.empty() && is_continuation_byte(result.back())) {
            result.remove_suffix(1);
        }
        if (!result.empty()) {
            // result.back() 是新代码点的起始
            result.remove_suffix(1);
            // result.back() 现在是最后一个代码点的结尾
        }
    }
    return result;
}

请注意，这是代码的翻译部分。

英文:

Assuming your input is a valid utf8 string, all code points are encoded as a non continuation byte followed by 0 to 3 continuation bytes.

Just cut off the string to the size, and cut right before the last non-continuation byte:

#include &lt;cstddef&gt;
#include &lt;string&gt;
#include &lt;string_view&gt;
constexpr bool is_continuation_byte(char c) {
    return (static_cast&lt;unsigned char&gt;(c) &amp; 0b11000000u) == 0b10000000u;
}
std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
    std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
    if (result.size() &gt; max_byte_size) {
        // This loop will run a maximum of 3 times on a valid utf8 string
        while (!result.empty() &amp;&amp; is_continuation_byte(result.back())) {
            result.remove_suffix(1);
        }
        if (!result.empty()) {
            // result.back() is the start of a new code point
            result.remove_suffix(1);
            // result.back() is now the end of the last code point
        }
    }
    return result;
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何缩短 UTF-8 字符串

问题

答案1

答案2

Bayer filter, bayer demosaicing OpenCV C++

在Go中逐行读取文件

有没有办法将一个函数模板作为另一个函数的参数传递？

如何解决cereal C++中的JSON序列化错误？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。