英文:
How to shorten uft8 string
问题
如何从一个UTF-8字符串中移除最小的后缀,以使其大小减少至少给定数量的字节。或者,我想要将一个字符串适应到一个缓冲区中,通过丢弃可能的最小后缀。那么,如何实现以下函数?
std::string shorted_utf8_string(std::string str, size_t max_length) {
std::string result;
// ... 做一些操作
assert(result.length() <= max_length);
return result;
}
基本上,问题是如何处理可变字节字符。
英文:
How to remove the smallest suffix from a utf8 string so that its size is reduced by at least a given number of bytes. Alternatively, I would like to fit a string into a buffer by discarding the smallest possible suffix. So, how to implement the following function?
std::string shorted_uft8_string(std::string str, size_t max_length) {
std::string result;
// ... do something
assert(result.length() <= max_length);
return result;
}
Basically, the problem is how to deal with variable byte characters.
答案1
得分: 2
在Boost库中有一个迭代器可以帮助您跟踪UTF-8 Unicode代码点。
这样,您可以扫描仍然完全位于缓冲区内的最后一个代码点:
std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
{
using Iter = boost::u8_to_u32_iterator<decltype(std::begin(s))>;
auto i = Iter{s.begin()};
while (i.base() != s.end()) {
auto next = i;
++next;
if (std::distance(s.begin(), next.base()) > maxBytes)
{
break;
}
i = next;
}
return {s.begin(), static_cast<size_t>(std::distance(s.begin(), i.base()))};
}
https://godbolt.org/z/szW4n3vqa
英文:
In boost there is iterator which can help you to track utf-8 Unicode codepoint.
This way you can scan for last codepoint which is still fully inside a buffer:
std::string_view clamp_utf8(std::string_view s, size_t maxBytes)
{
using Iter = boost::u8_to_u32_iterator<decltype(std::begin(s))>;
auto i = Iter{s.begin()};
while (i.base() != s.end()) {
auto next = i;
++next;
if (std::distance(s.begin(), next.base()) > maxBytes)
{
break;
}
i = next;
}
return {s.begin(), static_cast<size_t>(std::distance(s.begin(), i.base()))};
}
答案2
得分: 2
假设您的输入是有效的UTF-8字符串,所有代码点都被编码为一个非继续字节,后面跟着0到3个继续字节。
只需将字符串截断到指定大小,并在最后一个非继续字节之前截断:
#include <cstddef>
#include <string>
#include <string_view>
constexpr bool is_continuation_byte(char c) {
return (static_cast<unsigned char>(c) & 0b11000000u) == 0b10000000u;
}
std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
if (result.size() > max_byte_size) {
// 此循环在有效的UTF-8字符串上最多运行3次
while (!result.empty() && is_continuation_byte(result.back())) {
result.remove_suffix(1);
}
if (!result.empty()) {
// result.back() 是新代码点的起始
result.remove_suffix(1);
// result.back() 现在是最后一个代码点的结尾
}
}
return result;
}
请注意,这是代码的翻译部分。
英文:
Assuming your input is a valid utf8 string, all code points are encoded as a non continuation byte followed by 0 to 3 continuation bytes.
Just cut off the string to the size, and cut right before the last non-continuation byte:
#include <cstddef>
#include <string>
#include <string_view>
constexpr bool is_continuation_byte(char c) {
return (static_cast<unsigned char>(c) & 0b11000000u) == 0b10000000u;
}
std::string_view shorted_uft8_string(std::string_view str, std::size_t max_byte_size) {
std::string_view result(str.data(), std::min(str.size(), max_byte_size+1u));
if (result.size() > max_byte_size) {
// This loop will run a maximum of 3 times on a valid utf8 string
while (!result.empty() && is_continuation_byte(result.back())) {
result.remove_suffix(1);
}
if (!result.empty()) {
// result.back() is the start of a new code point
result.remove_suffix(1);
// result.back() is now the end of the last code point
}
}
return result;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论