2023年3月31日 21:54:19go评论70阅读模式

英文:

How to correctly use codecvt_byname (C++17) to encode latin1, and then UTF-8 for use in JSON

问题

我（拼命地）尝试准备一个字节数组（从PLC复制，其中将“字符串”构造为字节数组，区域设置/编码为德语、法语等），以便在nlohmann::json中使用，同时保留源编码（latin1）。

使用这个玩具示例，编译器对~codecvt()和~codecvt_byname()进行了保护：

/usr/bin/g++   -O3 -DNDEBUG -std=c++17 -MD -MT CMakeFiles/encod.dir/src/encod.cpp.o -MF CMakeFiles/encod.dir/src/encod.cpp.o.d -o CMakeFiles/encod.dir/src/encod.cpp.o -c /src/encod.cpp
在文件/usr/include/c++/12/locale:43中包含，来自/src/encod.cpp:1。
/usr/include/c++/12/bits/locale_conv.h: 在 ‘std::__detail::_Scoped_ptr&lt;_Tp&gt;::~_Scoped_ptr() [with _Tp = std::codecvt&lt;wchar_t, char, __mbstate_t&gt;]’ 的实例化中：
/usr/include/c++/12/bits/locale_conv.h:309:7:   在此需要
/usr/include/c++/12/bits/locale_conv.h:241:26: 错误：‘virtual std::codecvt&lt;wchar_t, char, __mbstate_t&gt;::~codecvt()’ 在此上下文中是受保护的
  241 |         ~_Scoped_ptr() { delete _M_ptr; }
      |                          ^~~~~~~~~~~~~
在文件/usr/include/c++/12/bits/locale_facets_nonio.h:2067中包含，来自/usr/include/c++/12/locale:41。
/usr/include/c++/12/bits/codecvt.h:429:7: 注意：在此处受保护声明
  429 |       ~codecvt();
      |       ^
在文件/usr/include/c++/12/memory:76中包含，来自/src/encod.cpp:6。
/usr/include/c++/12/bits/unique_ptr.h: 在 ‘void std::default_delete&lt;_Tp&gt;::operator()(_Tp*) const [with _Tp = std::codecvt_byname&lt;wchar_t, char, __mbstate_t&gt;]’ 的实例化中：
/usr/include/c++/12/bits/unique_ptr.h:396:17:   在此需要
/src/encod.cpp:18:152:   在此需要
/usr/include/c++/12/bits/unique_ptr.h:95:9: 错误：‘std::codecvt_byname&lt;_InternT, _ExternT, _StateT&gt;::~codecvt_byname() [with _InternT = wchar_t; _ExternT = char; _StateT = __mbstate_t]’ 在此上下文中是受保护的
   95 |         delete __ptr;
      |         ^~~~~~~~~~~~  
/usr/include/c++/12/bits/codecvt.h:722:7: 注意：在此处受保护声明
  722 |       ~codecvt_byname() { }
      |       ^

#include &lt;locale&gt;
#include &lt;codecvt&gt;
#include &lt;vector&gt;
#include &lt;string&gt;
#include &lt;iostream&gt;
#include &lt;memory&gt;

int main() {
    std::vector&lt;uint8_t&gt; v = {0x68, 0xe4, 0x6c, 0x6c, 0x6f}; // h&#228;llo

    std::string my_string(v.begin(), v.end());

    // Convert to wide string
    std::wstring_convert&lt;std::codecvt_utf8&lt;wchar_t&gt;&gt; utf8_conv;
    std::wstring wide_str = utf8_conv.from_bytes(my_string);

    // Convert wide string to Latin1 string
    std::unique_ptr&lt;std::codecvt_byname&lt;wchar_t, char, std::mbstate_t&gt;&gt; 
            latin1_cvt(new std::codecvt_byname&lt;wchar_t, char, std::mbstate_t&gt;("iso-8859-1"));
    std::wstring_convert&lt;std::codecvt&lt;wchar_t, char, std::mbstate_t&gt;&gt; latin1_conv(latin1_cvt.get());
    std::string latin1_str = latin1_conv.to_bytes(wide_str);


    std::cout &lt;&lt; latin1_str &lt;&lt; std::endl;

    return 0;
}

我该如何使它工作？在这种情况下，我应该更好地使用ICU，即我用错了吗？

英文:

I am (desperately) trying to prepare a byte array (copied from a PLC, where they construct the "string" as a byte array, locale/encoding is German, French, etc) for use in nlohmann::json, while preserving the source encoding (latin1).

Using this toy example, the compiler complains about ~codecvt() and ~codecvt_byname() being protected:

/usr/bin/g++   -O3 -DNDEBUG -std=c++17 -MD -MT CMakeFiles/encod.dir/src/encod.cpp.o -MF CMakeFiles/encod.dir/src/encod.cpp.o.d -o CMakeFiles/encod.dir/src/encod.cpp.o -c /src/encod.cpp
In file included from /usr/include/c++/12/locale:43,
                 from /src/encod.cpp:1:
/usr/include/c++/12/bits/locale_conv.h: In instantiation of ‘std::__detail::_Scoped_ptr&lt;_Tp&gt;::~_Scoped_ptr() [with _Tp = std::codecvt&lt;wchar_t, char, __mbstate_t&gt;]’:
/usr/include/c++/12/bits/locale_conv.h:309:7:   required from here
/usr/include/c++/12/bits/locale_conv.h:241:26: error: ‘virtual std::codecvt&lt;wchar_t, char, __mbstate_t&gt;::~codecvt()’ is protected within this context
  241 |         ~_Scoped_ptr() { delete _M_ptr; }
      |                          ^~~~~~~~~~~~~
In file included from /usr/include/c++/12/bits/locale_facets_nonio.h:2067,
                 from /usr/include/c++/12/locale:41:
/usr/include/c++/12/bits/codecvt.h:429:7: note: declared protected here
  429 |       ~codecvt();
      |       ^
In file included from /usr/include/c++/12/memory:76,
                 from /src/encod.cpp:6:
/usr/include/c++/12/bits/unique_ptr.h: In instantiation of ‘void std::default_delete&lt;_Tp&gt;::operator()(_Tp*) const [with _Tp = std::codecvt_byname&lt;wchar_t, char, __mbstate_t&gt;]’:
/usr/include/c++/12/bits/unique_ptr.h:396:17:   required from ‘std::unique_ptr&lt;_Tp, _Dp&gt;::~unique_ptr() [with _Tp = std::codecvt_byname&lt;wchar_t, char, __mbstate_t&gt;; _Dp = std::default_delete&lt;std::codecvt_byname&lt;wchar_t, char, __mbstate_t&gt; &gt;]’
/src/encod.cpp:18:152:   required from here
/usr/include/c++/12/bits/unique_ptr.h:95:9: error: ‘std::codecvt_byname&lt;_InternT, _ExternT, _StateT&gt;::~codecvt_byname() [with _InternT = wchar_t; _ExternT = char; _StateT = __mbstate_t]’ is protected within this context
   95 |         delete __ptr;
      |         ^~~~~~~~~~~~
/usr/include/c++/12/bits/codecvt.h:722:7: note: declared protected here
  722 |       ~codecvt_byname() { }
      |       ^

#include &lt;locale&gt;
#include &lt;codecvt&gt;
#include &lt;vector&gt;
#include &lt;string&gt;
#include &lt;iostream&gt;
#include &lt;memory&gt;

int main() {
    std::vector&lt;uint8_t&gt; v = {0x68, 0xe4, 0x6c, 0x6c, 0x6f}; // h&#228;llo

    std::string my_string(v.begin(), v.end());

    // Convert to wide string
    std::wstring_convert&lt;std::codecvt_utf8&lt;wchar_t&gt;&gt; utf8_conv;
    std::wstring wide_str = utf8_conv.from_bytes(my_string);

    // Convert wide string to Latin1 string
    std::unique_ptr&lt;std::codecvt_byname&lt;wchar_t, char, std::mbstate_t&gt;&gt; 
            latin1_cvt(new std::codecvt_byname&lt;wchar_t, char, std::mbstate_t&gt;(&quot;iso-8859-1&quot;));
    std::wstring_convert&lt;std::codecvt&lt;wchar_t, char, std::mbstate_t&gt;&gt; latin1_conv(latin1_cvt.get());
    std::string latin1_str = latin1_conv.to_bytes(wide_str);


    std::cout &lt;&lt; latin1_str &lt;&lt; std::endl;

    return 0;
}

How can I make this work? Should I better use ICU for this scenario, ie am I holding (using) it wrong?

答案1

得分: 1

请注意，大多数std::codecvt_...类型都已被弃用，因此您不应再使用它们。但是，它们仍然适用于现有的实现。

也就是说，您只是错误地使用了std::codecvt_byname，这就是为什么您得到编译器错误的原因。

与std::codecvt_utf...类不同，它们被设计为可以单独使用，因此具有public析构函数，std::codecvt_byname是一个由区域设置管理的facet，因此它有一个protected析构函数，这意味着您不能直接销毁std::codecvt_byname对象。由std::locale拥有的区域设置管理的facets，它将销毁分配给它的任何facet。这在cppreference.com的~codecvt文档中有提及：

https://en.cppreference.com/w/cpp/locale/codecvt/%7Ecodecvt

析构函数std::codecvt facet。此析构函数是受保护的和虚拟的（由于基类析构函数是虚拟的）。像大多数facets一样，只有当实现该facet的最后一个std::locale对象超出范围时或者如果用户定义的类派生自std::codecvt并实现了public析构函数，才能销毁std::codecvt类型的对象。

这意味着，您不能将std::codecvt_byname直接用作std::unique_ptr持有的类型。但是，如上所述，您可以从std::codecvt_byname派生一个新类，并为其提供public析构函数。在cppreference.com的std::wstring_convert文档中甚至演示了这一点：

https://en.cppreference.com/w/cpp/locale/wstring_convert/wstring_convert

#include <locale>
#include <utility>
#include <codecvt>
  
// 用于使wstring/wbuffer转换适应区域设置绑定facets的实用包装器
template<class Facet>
struct deletable_facet : Facet
{
    using Facet::Facet; // 继承构造函数
    ~deletable_facet() {}
};
  
int main()
{
    // UTF-16le / UCS4 转换
    std::wstring_convert<
         std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>
     > u16to32;
  
    // UTF-8 / 宽字符串转换，带有自定义消息
    std::wstring_convert<std::codecvt_utf8<wchar_t>> u8towide("Error!", L"Error!");

    // GB18030 / 宽字符串转换facet
    typedef deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>> F;
    std::wstring_convert<F> gbtowide(new F("zh_CN.gb18030"));
}

https://en.cppreference.com/w/cpp/locale/wstring_convert/%7Ewstring_convert

#include <locale>
#include <utility>
#include <codecvt>
  
// 用于使wstring/wbuffer转换适应区域设置绑定facets的实用包装器
template<class Facet>
struct deletable_facet : Facet
{
    template<class ...Args>
    deletable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
    ~deletable_facet() {}
};
  
int main()
{
    // GB18030 / UCS4 转换，直接使用基于区域设置的facet
    // typedef std::codecvt_byname<char32_t, char, std::mbstate_t> gbfacet_t;
    // 编译器错误："在~wstring_convert中调用了protected析构函数的codecvt_byname<>"
    // std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));

    // 使用具有public析构函数的facet进行GB18030 / UCS4转换facet
    typedef deletable_facet<std::codecvt_byname<char32_t, char, std::mbstate_t>> gbfacet_t;
    std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));
} // 析构函数在此处被调用

请注意在两个示例中都使用了deletable_facet<std::codecvt_byname<...>>。

另外，请注意std::wstring_convert接管了您提供给它的转换facet的所有权，因此您不能使用std::unique_ptr来管理其生命周期。

因此，在您的示例中，请使用以下代码：

// 将宽字符串转换为Latin1字符串
using latin1_cvt = deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
std::wstring_convert<latin1_cvt> latin1_conv(new latin1_cvt("iso-8859-1"));
std::string latin1_str = latin1_conv.to_bytes(wide_str);

英文:

Note that most of the std::codecvt_... types are deprecated, so you should not be using them anymore. However, they do still work for existing implementations.

That said, you are simply using std::codecvt_byname wrong, which is why you are getting the compiler error.

Unlike the std::codecvt_utf... classes, which are meant to be usable by themselves and thus have public destructors, std::codecvt_byname is a locale-managed facet and so it has a protected destructor, which means you cannot destroy a std::codecvt_byname object directly. Locale-managed facets are owned by std::locale, and it will destroy any facet that is assigned to it. This is mentioned in the ~codecvt documentation on cppreference.com:

https://en.cppreference.com/w/cpp/locale/codecvt/%7Ecodecvt

> Destructs a std::codecvt facet. This destructor is protected and virtual (due to base class destructor being virtual). An object of type std::codecvt, like most facets, can only be destroyed when the last std::locale object that implements this facet goes out of scope or if a user-defined class is derived from std::codecvt and implements a public destructor.

Which means, you can't use std::codecvt_byname as the direct type held by a std::unique_ptr. But, as mentioned above, you can derive a new class from std::codecvt_byname and give it a public destructor. This is even demonstrated in the std::wstring_convert documentation on cppreference.com:

https://en.cppreference.com/w/cpp/locale/wstring_convert/wstring_convert

> ```c++
> #include <locale>
> #include <utility>
> #include <codecvt>
>
> // utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
> template<class Facet>
> struct deletable_facet : Facet
> {
> using Facet::Facet; // inherit constructors
> ~deletable_facet() {}
> };
>
> int main()
> {
> // UTF-16le / UCS4 conversion
> std::wstring_convert<
> std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>
> > u16to32;
>
> // UTF-8 / wide string conversion with custom messages
> std::wstring_convert<std::codecvt_utf8<wchar_t>> u8towide("Error!", L"Error!");
>
> // GB18030 / wide string conversion facet
> typedef deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>> F;
> std::wstring_convert<F> gbtowide(new F("zh_CN.gb18030"));
> }

https://en.cppreference.com/w/cpp/locale/wstring_convert/%7Ewstring_convert

> c++ > #include <locale> > #include <utility> > #include <codecvt> > > // utility wrapper to adapt locale-bound facets for wstring/wbuffer convert > template<class Facet> > struct deletable_facet : Facet > { > template<class ...Args> > deletable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {} > ~deletable_facet() {} > }; > > int main() > { > // GB18030 / UCS4 conversion, using locale-based facet directly > // typedef std::codecvt_byname<char32_t, char, std::mbstate_t> gbfacet_t; > // Compiler error: "calling a protected destructor of codecvt_byname<> in ~wstring_convert" > // std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030")); > > // GB18030 / UCS4 conversion facet using a facet with public destructor > typedef deletable_facet<std::codecvt_byname<char32_t, char, std::mbstate_t>> gbfacet_t; > std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030")); > } // destructor called here >

Note the use of deletable_facet<std::codecvt_byname<...>> in both examples.

Also, note that std::wstring_convert takes ownership of the conversion facet that you give it, so you cannot use std::unique_ptr to manage its lifetime.

Thus, in your example, use this instead:

// Convert wide string to Latin1 string
using latin1_cvt = deletable_facet&lt;std::codecvt_byname&lt;wchar_t, char, std::mbstate_t&gt;&gt;;
std::wstring_convert&lt;latin1_cvt&gt; latin1_conv(new latin1_cvt(&quot;iso-8859-1&quot;));
std::string latin1_str = latin1_conv.to_bytes(wide_str);

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何正确使用codecvt_byname（C++17）将Latin1编码为UTF-8，然后在JSON中使用。

问题

答案1

移动`std::shared_ptr`的开销是多少？

从C++中的字符串中获取特定值

如何在类内部创建一个单独的线程？

Studying initialization in C++: what does "error: expected '(' for function-style cast or type construction" means in this case?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论