英文:
How to correctly use codecvt_byname (C++17) to encode latin1, and then UTF-8 for use in JSON
问题
我(拼命地)尝试准备一个字节数组(从PLC复制,其中将“字符串”构造为字节数组,区域设置/编码为德语、法语等),以便在nlohmann::json中使用,同时保留源编码(latin1)。
使用这个玩具示例,编译器对~codecvt()
和~codecvt_byname()
进行了保护:
/usr/bin/g++ -O3 -DNDEBUG -std=c++17 -MD -MT CMakeFiles/encod.dir/src/encod.cpp.o -MF CMakeFiles/encod.dir/src/encod.cpp.o.d -o CMakeFiles/encod.dir/src/encod.cpp.o -c /src/encod.cpp
在文件/usr/include/c++/12/locale:43中包含,来自/src/encod.cpp:1。
/usr/include/c++/12/bits/locale_conv.h: 在 ‘std::__detail::_Scoped_ptr<_Tp>::~_Scoped_ptr() [with _Tp = std::codecvt<wchar_t, char, __mbstate_t>]’ 的实例化中:
/usr/include/c++/12/bits/locale_conv.h:309:7: 在此需要
/usr/include/c++/12/bits/locale_conv.h:241:26: 错误:‘virtual std::codecvt<wchar_t, char, __mbstate_t>::~codecvt()’ 在此上下文中是受保护的
241 | ~_Scoped_ptr() { delete _M_ptr; }
| ^~~~~~~~~~~~~
在文件/usr/include/c++/12/bits/locale_facets_nonio.h:2067中包含,来自/usr/include/c++/12/locale:41。
/usr/include/c++/12/bits/codecvt.h:429:7: 注意:在此处受保护声明
429 | ~codecvt();
| ^
在文件/usr/include/c++/12/memory:76中包含,来自/src/encod.cpp:6。
/usr/include/c++/12/bits/unique_ptr.h: 在 ‘void std::default_delete<_Tp>::operator()(_Tp*) const [with _Tp = std::codecvt_byname<wchar_t, char, __mbstate_t>]’ 的实例化中:
/usr/include/c++/12/bits/unique_ptr.h:396:17: 在此需要
/src/encod.cpp:18:152: 在此需要
/usr/include/c++/12/bits/unique_ptr.h:95:9: 错误:‘std::codecvt_byname<_InternT, _ExternT, _StateT>::~codecvt_byname() [with _InternT = wchar_t; _ExternT = char; _StateT = __mbstate_t]’ 在此上下文中是受保护的
95 | delete __ptr;
| ^~~~~~~~~~~~
/usr/include/c++/12/bits/codecvt.h:722:7: 注意:在此处受保护声明
722 | ~codecvt_byname() { }
| ^
#include <locale>
#include <codecvt>
#include <vector>
#include <string>
#include <iostream>
#include <memory>
int main() {
std::vector<uint8_t> v = {0x68, 0xe4, 0x6c, 0x6c, 0x6f}; // hällo
std::string my_string(v.begin(), v.end());
// Convert to wide string
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
std::wstring wide_str = utf8_conv.from_bytes(my_string);
// Convert wide string to Latin1 string
std::unique_ptr<std::codecvt_byname<wchar_t, char, std::mbstate_t>>
latin1_cvt(new std::codecvt_byname<wchar_t, char, std::mbstate_t>("iso-8859-1"));
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> latin1_conv(latin1_cvt.get());
std::string latin1_str = latin1_conv.to_bytes(wide_str);
std::cout << latin1_str << std::endl;
return 0;
}
我该如何使它工作?在这种情况下,我应该更好地使用ICU,即我用错了吗?
英文:
I am (desperately) trying to prepare a byte array (copied from a PLC, where they construct the "string" as a byte array, locale/encoding is German, French, etc) for use in nlohmann::json, while preserving the source encoding (latin1).
Using this toy example, the compiler complains about ~codecvt()
and ~codecvt_byname()
being protected:
/usr/bin/g++ -O3 -DNDEBUG -std=c++17 -MD -MT CMakeFiles/encod.dir/src/encod.cpp.o -MF CMakeFiles/encod.dir/src/encod.cpp.o.d -o CMakeFiles/encod.dir/src/encod.cpp.o -c /src/encod.cpp
In file included from /usr/include/c++/12/locale:43,
from /src/encod.cpp:1:
/usr/include/c++/12/bits/locale_conv.h: In instantiation of ‘std::__detail::_Scoped_ptr<_Tp>::~_Scoped_ptr() [with _Tp = std::codecvt<wchar_t, char, __mbstate_t>]’:
/usr/include/c++/12/bits/locale_conv.h:309:7: required from here
/usr/include/c++/12/bits/locale_conv.h:241:26: error: ‘virtual std::codecvt<wchar_t, char, __mbstate_t>::~codecvt()’ is protected within this context
241 | ~_Scoped_ptr() { delete _M_ptr; }
| ^~~~~~~~~~~~~
In file included from /usr/include/c++/12/bits/locale_facets_nonio.h:2067,
from /usr/include/c++/12/locale:41:
/usr/include/c++/12/bits/codecvt.h:429:7: note: declared protected here
429 | ~codecvt();
| ^
In file included from /usr/include/c++/12/memory:76,
from /src/encod.cpp:6:
/usr/include/c++/12/bits/unique_ptr.h: In instantiation of ‘void std::default_delete<_Tp>::operator()(_Tp*) const [with _Tp = std::codecvt_byname<wchar_t, char, __mbstate_t>]’:
/usr/include/c++/12/bits/unique_ptr.h:396:17: required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = std::codecvt_byname<wchar_t, char, __mbstate_t>; _Dp = std::default_delete<std::codecvt_byname<wchar_t, char, __mbstate_t> >]’
/src/encod.cpp:18:152: required from here
/usr/include/c++/12/bits/unique_ptr.h:95:9: error: ‘std::codecvt_byname<_InternT, _ExternT, _StateT>::~codecvt_byname() [with _InternT = wchar_t; _ExternT = char; _StateT = __mbstate_t]’ is protected within this context
95 | delete __ptr;
| ^~~~~~~~~~~~
/usr/include/c++/12/bits/codecvt.h:722:7: note: declared protected here
722 | ~codecvt_byname() { }
| ^
#include <locale>
#include <codecvt>
#include <vector>
#include <string>
#include <iostream>
#include <memory>
int main() {
std::vector<uint8_t> v = {0x68, 0xe4, 0x6c, 0x6c, 0x6f}; // hällo
std::string my_string(v.begin(), v.end());
// Convert to wide string
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
std::wstring wide_str = utf8_conv.from_bytes(my_string);
// Convert wide string to Latin1 string
std::unique_ptr<std::codecvt_byname<wchar_t, char, std::mbstate_t>>
latin1_cvt(new std::codecvt_byname<wchar_t, char, std::mbstate_t>("iso-8859-1"));
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> latin1_conv(latin1_cvt.get());
std::string latin1_str = latin1_conv.to_bytes(wide_str);
std::cout << latin1_str << std::endl;
return 0;
}
How can I make this work? Should I better use ICU for this scenario, ie am I holding (using) it wrong?
答案1
得分: 1
请注意,大多数std::codecvt_...
类型都已被弃用,因此您不应再使用它们。但是,它们仍然适用于现有的实现。
也就是说,您只是错误地使用了std::codecvt_byname
,这就是为什么您得到编译器错误的原因。
与std::codecvt_utf...
类不同,它们被设计为可以单独使用,因此具有public
析构函数,std::codecvt_byname
是一个由区域设置管理的facet,因此它有一个protected
析构函数,这意味着您不能直接销毁std::codecvt_byname
对象。由std::locale
拥有的区域设置管理的facets,它将销毁分配给它的任何facet。这在cppreference.com的~codecvt
文档中有提及:
https://en.cppreference.com/w/cpp/locale/codecvt/%7Ecodecvt
析构函数
std::codecvt
facet。此析构函数是受保护的和虚拟的(由于基类析构函数是虚拟的)。像大多数facets一样,只有当实现该facet的最后一个std::locale
对象超出范围时或者如果用户定义的类派生自std::codecvt
并实现了public析构函数,才能销毁std::codecvt
类型的对象。
这意味着,您不能将std::codecvt_byname
直接用作std::unique_ptr
持有的类型。但是,如上所述,您可以从std::codecvt_byname
派生一个新类,并为其提供public析构函数。在cppreference.com的std::wstring_convert
文档中甚至演示了这一点:
https://en.cppreference.com/w/cpp/locale/wstring_convert/wstring_convert
#include <locale>
#include <utility>
#include <codecvt>
// 用于使wstring/wbuffer转换适应区域设置绑定facets的实用包装器
template<class Facet>
struct deletable_facet : Facet
{
using Facet::Facet; // 继承构造函数
~deletable_facet() {}
};
int main()
{
// UTF-16le / UCS4 转换
std::wstring_convert<
std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>
> u16to32;
// UTF-8 / 宽字符串转换,带有自定义消息
std::wstring_convert<std::codecvt_utf8<wchar_t>> u8towide("Error!", L"Error!");
// GB18030 / 宽字符串转换facet
typedef deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>> F;
std::wstring_convert<F> gbtowide(new F("zh_CN.gb18030"));
}
https://en.cppreference.com/w/cpp/locale/wstring_convert/%7Ewstring_convert
#include <locale>
#include <utility>
#include <codecvt>
// 用于使wstring/wbuffer转换适应区域设置绑定facets的实用包装器
template<class Facet>
struct deletable_facet : Facet
{
template<class ...Args>
deletable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
~deletable_facet() {}
};
int main()
{
// GB18030 / UCS4 转换,直接使用基于区域设置的facet
// typedef std::codecvt_byname<char32_t, char, std::mbstate_t> gbfacet_t;
// 编译器错误:"在~wstring_convert中调用了protected析构函数的codecvt_byname<>"
// std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));
// 使用具有public析构函数的facet进行GB18030 / UCS4转换facet
typedef deletable_facet<std::codecvt_byname<char32_t, char, std::mbstate_t>> gbfacet_t;
std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));
} // 析构函数在此处被调用
请注意在两个示例中都使用了deletable_facet<std::codecvt_byname<...>>
。
另外,请注意std::wstring_convert
接管了您提供给它的转换facet的所有权,因此您不能使用std::unique_ptr
来管理其生命周期。
因此,在您的示例中,请使用以下代码:
// 将宽字符串转换为Latin1字符串
using latin1_cvt = deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
std::wstring_convert<latin1_cvt> latin1_conv(new latin1_cvt("iso-8859-1"));
std::string latin1_str = latin1_conv.to_bytes(wide_str);
英文:
Note that most of the std::codecvt_...
types are deprecated, so you should not be using them anymore. However, they do still work for existing implementations.
That said, you are simply using std::codecvt_byname
wrong, which is why you are getting the compiler error.
Unlike the std::codecvt_utf...
classes, which are meant to be usable by themselves and thus have public
destructors, std::codecvt_byname
is a locale-managed facet and so it has a protected
destructor, which means you cannot destroy a std::codecvt_byname
object directly. Locale-managed facets are owned by std::locale
, and it will destroy any facet that is assigned to it. This is mentioned in the ~codecvt
documentation on cppreference.com:
https://en.cppreference.com/w/cpp/locale/codecvt/%7Ecodecvt
> Destructs a std::codecvt
facet. This destructor is protected and virtual (due to base class destructor being virtual). An object of type std::codecvt
, like most facets, can only be destroyed when the last std::locale
object that implements this facet goes out of scope or if a user-defined class is derived from std::codecvt
and implements a public destructor.
Which means, you can't use std::codecvt_byname
as the direct type held by a std::unique_ptr
. But, as mentioned above, you can derive a new class from std::codecvt_byname
and give it a public destructor. This is even demonstrated in the std::wstring_convert
documentation on cppreference.com:
https://en.cppreference.com/w/cpp/locale/wstring_convert/wstring_convert
> ```c++
> #include <locale>
> #include <utility>
> #include <codecvt>
>
> // utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
> template<class Facet>
> struct deletable_facet : Facet
> {
> using Facet::Facet; // inherit constructors
> ~deletable_facet() {}
> };
>
> int main()
> {
> // UTF-16le / UCS4 conversion
> std::wstring_convert<
> std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>
> > u16to32;
>
> // UTF-8 / wide string conversion with custom messages
> std::wstring_convert<std::codecvt_utf8<wchar_t>> u8towide("Error!", L"Error!");
>
> // GB18030 / wide string conversion facet
> typedef deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>> F;
> std::wstring_convert<F> gbtowide(new F("zh_CN.gb18030"));
> }
https://en.cppreference.com/w/cpp/locale/wstring_convert/%7Ewstring_convert
> c++
> #include <locale>
> #include <utility>
> #include <codecvt>
>
> // utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
> template<class Facet>
> struct deletable_facet : Facet
> {
> template<class ...Args>
> deletable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
> ~deletable_facet() {}
> };
>
> int main()
> {
> // GB18030 / UCS4 conversion, using locale-based facet directly
> // typedef std::codecvt_byname<char32_t, char, std::mbstate_t> gbfacet_t;
> // Compiler error: "calling a protected destructor of codecvt_byname<> in ~wstring_convert"
> // std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));
>
> // GB18030 / UCS4 conversion facet using a facet with public destructor
> typedef deletable_facet<std::codecvt_byname<char32_t, char, std::mbstate_t>> gbfacet_t;
> std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));
> } // destructor called here
>
Note the use of deletable_facet<std::codecvt_byname<...>>
in both examples.
Also, note that std::wstring_convert
takes ownership of the conversion facet that you give it, so you cannot use std::unique_ptr
to manage its lifetime.
Thus, in your example, use this instead:
// Convert wide string to Latin1 string
using latin1_cvt = deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
std::wstring_convert<latin1_cvt> latin1_conv(new latin1_cvt("iso-8859-1"));
std::string latin1_str = latin1_conv.to_bytes(wide_str);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论