如何在C++20中将表情存储在char8_t中并打印它们?

huangapple go评论78阅读模式
英文:

How to Store Emojis in char8_t and Print Them Out in C++20?

问题

我刚刚了解到char8_tchar16_tchar32_t的存在,并正在测试它。当我尝试编译下面的代码时,g++抛出以下错误:

error: use of deleted function std::basic_ostream<char, _Traits>& std::operator<<(std::basic_ostream<char, _Traits>&, char32_t) [with _Traits = std::char_traits<char>]
    6 |         std::cout << U'🙂' << std::endl;
      |                      ^~~~~

此外,为什么我不能将表情符号放入char8_tchar16_t中呢?例如,以下代码行不起作用:

char16_t c1 = u'🙂';
char8_t c2 = u8'🙂';
auto c3 = u'🙂';
auto c4 = u8'🙂';

据我理解,表情符号是UTF-8字符,因此应该适应char8_t

英文:

I just now heard about the existence of char8_t, char16_t and char32_t and I am testing it out. When I try to compile the code below, g++ throws the following error:

error: use of deleted function ‘std::basic_ostream&lt;char, _Traits&gt;&amp; std::operator&lt;&lt;(basic_ostream&lt;char, _Traits&gt;&amp;, char32_t) [with _Traits = char_traits&lt;char&gt;]’
    6 |         std::cout &lt;&lt; U&#39;&#128523;&#39; &lt;&lt; std::endl;
      |                      ^~~~~
#include &lt;iostream&gt;

int main() {
  char32_t c = U&#39;&#128523;&#39;;

  std::cout &lt;&lt; c &lt;&lt; std::endl;

  return 0;
}

Additionally, why can't I put the emoji into a char8_t or char16_t? For example, the following lines of code don't work:

char16_t c1 = u&#39;&#128523;&#39;;
char8_t c2 = u8&#39;&#128523;&#39;;
auto c3 = u&#39;&#128523;&#39;;
auto c4 = u8&#39;&#128523;&#39;;

From my understanding, emojis are UTF-8 characters and should therefore fit into a char8_t.

答案1

得分: 5

表情符号是UTF-8字符

并没有所谓的“UTF-8字符”。

存在Unicode代码点。这些可以用UTF-8编码表示,以便每个代码点映射到一个或多个UTF-8代码单元的序列:char8_t。但这意味着大多数代码点映射到多个char8_t,也就是字符串。而表情符号不属于将映射到单个UTF-8代码单元的127个代码点之一。

特别是表情符号可以由多个代码点构建,所以即使使用UTF-32,也不能保证任何表情符号都可以存储在单个char32_t代码点中。

最好始终将这些东西视为字符串,而不是字符。忘记“字符”甚至存在。

英文:

> emojis are UTF-8 characters

There is no such thing as a "UTF-8 character".

There are Unicode codepoints. These can be represented in the UTF-8 encoding, such that each codepoint maps to a sequence of one or more UTF-8 code units: char8_ts. But that means that most codepoints map to multiple char8_ts: AKA, a string. And Emojis are not among the 127 codepoints that map to a single UTF-8 code unit.

Emoji in particular can be built out of multiple codepoints, so even using UTF-32, you cannot guarantee that any emoji could be stored in a single char32_t codepoint.

It's best to treat these things as strings, not characters, at all times. Forget that "characters" even exist.

答案2

得分: 3

// 在Visual C++中使用Windows终端进行测试。
// 在Windows终端中使用Visual C++测试。

// GCC [https://godbolt.org/z/cMbeoGf9a][3]
// Clang [https://godbolt.org/z/EhfdaM61x][4]

// 如果在Windows上编译,请设置控制台输出为UTF-8。
// 如果在Windows上编译,请设置控制台输入和输出为UTF-8。

// 如果支持char8_t特性,进行以下操作:

// 重载输出运算符&lt;&lt;,以输出std::u8string。
// 将std::u8string转换为std::string。
// 将std::string转换为std::u8string。
// 将const char8_t*字面量转换为std::string。
// 使用std::string_literals命名空间。

// u8string文字字面量。
// 使用"_s"运算符将utf8字面量(const char8_t*)转换为std::string。
// 使用std::string_literals操作符。

#include &lt;iostream&gt;

#ifdef _WIN32 
#include &lt;Windows.h&gt;
#define SET_CONSOLE_UTF8 SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8); // 在Windows上设置控制台输出为UTF-8。
#endif // _WIN32 

#if defined(__cpp_char8_t) | defined(__cpp_lib_char8_t)
std::ostream&amp; operator&lt;&lt;(std::ostream&amp; os, const std::u8string&amp; str)
{
	os &lt;&lt; reinterpret_cast&lt;const char*&gt;(str.data());
	return os;
}

std::string ToString(const std::u8string&amp; s) {
	return std::string(s.begin(), s.end());
}

std::u8string Tou8String(const std::string&amp; s) {
	return std::u8string(s.begin(), s.end());
}

static inline std::string operator&quot;&quot; _s(const char8_t* value, size_t size) {
	static std::string x(reinterpret_cast&lt;const char*&gt;(value), size);
	return x;
}

#endif

using namespace std::string_literals;

int main() {
#ifdef _WIN32
	SET_CONSOLE_UTF8
#endif

	std::u8string u8String = u8"&amp;#128523;&amp;#128523;&amp;#128523;&amp;#128523;"s;
	std::string str = u8"&amp;#128523;&amp;#128523;&amp;#128523;&amp;#128523;"_s;

	std::cout &lt;&lt; "string              " &lt;&lt; str &lt;&lt; std::endl;
	std::cout &lt;&lt; "u8string -&gt; string  " &lt;&lt; ToString(u8String) &lt;&lt; std::endl;
	std::cout &lt;&lt; "u8string            " &lt;&lt; u8String &lt;&lt; std::endl;
	std::cout &lt;&lt; "string -&gt; u8string  " &lt;&lt; Tou8String(str) &lt;&lt; std::endl;

	std::cin.get();
	return 0;
}
string              &#128523;&#128523;&#128523;&#128523;
u8string -&gt; string  &#128523;&#128523;&#128523;&#128523;
u8string            &#128523;&#128523;&#128523;&#128523;
string -&gt; u8string  &#128523;&#128523;&#128523;&#128523;

如何在C++20中将表情存储在char8_t中并打印它们?

如何在C++20中将表情存储在char8_t中并打印它们?

如何在C++20中将表情存储在char8_t中并打印它们?


【1】: https://learn.microsoft.com/en-us/windows/terminal/install
【2】: https://github.com/JomaStackOverflowAnswers/EmojiCpp20
【3】: https://godbolt.org/z/cMbeoGf9a
【4】: https://godbolt.org/z/EhfdaM61x
【5】: https://godbolt.org/
【6】: https://i.stack.imgur.com/fo3Mn.png
【7】: https://i.stack.imgur.com/MTClD.png
【8】: https://i.stack.imgur.com/bcLdT.png
英文:

Code

#include &lt;iostream&gt;

#ifdef _WIN32 
#include &lt;Windows.h&gt;
#define SET_CONSOLE_UTF8 SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8); //Set console output to UTF-8.Visual C++ code on Windows.
#endif // _WIN32 


#if defined(__cpp_char8_t) | defined(__cpp_lib_char8_t)

//Operator &lt;&lt;
std::ostream&amp; operator&lt;&lt;(std::ostream&amp; os, const std::u8string&amp; str)
{
	os &lt;&lt; reinterpret_cast&lt;const char*&gt;(str.data());
	return os;
}

//Convert u8string to string.
std::string ToString(const std::u8string&amp; s) {
	return std::string(s.begin(), s.end());
}

std::u8string Tou8String(const std::string&amp; s) {
	return std::u8string(s.begin(), s.end());
}

//const char8_t* literal to string. Operator &quot;&quot;_s
static inline std::string operator&quot;&quot; _s(const char8_t* value, size_t size) {
	static std::string x(reinterpret_cast&lt;const char*&gt;(value), size);
	return x;
}

#endif


using namespace std::string_literals;// operator &quot;&quot;s

int main() {
#ifdef _WIN32
	SET_CONSOLE_UTF8
#endif

	std::u8string u8String = u8&quot;&#128523;&#128523;&#128523;&#128523;&quot;s;// u8string literal.
	std::string str = u8&quot;&#128523;&#128523;&#128523;&#128523;&quot;_s; //Operator &quot;_s&quot;. Convert utf8 literal(const char8_t*) to std::string. 

	std::cout &lt;&lt; &quot;string              &quot; &lt;&lt; str &lt;&lt; std::endl; //Using operator &lt;&lt; for std::string
	std::cout &lt;&lt; &quot;u8string -&gt; string  &quot; &lt;&lt; ToString(u8String) &lt;&lt; std::endl; //Using function ToString(u8string) -&gt; string
	std::cout &lt;&lt; &quot;u8string            &quot; &lt;&lt; u8String &lt;&lt; std::endl; //Using operator &lt;&lt; for std::u8string.
	std::cout &lt;&lt; &quot;string -&gt; u8string  &quot; &lt;&lt; Tou8String(str) &lt;&lt; std::endl; //Using function Tou8String(string) -&gt; u8string

	std::cin.get();
	return 0;
}

Output Windows Terminal and https://godbolt.org/(Clang and GCC)

string              &#128523;&#128523;&#128523;&#128523;
u8string -&gt; string  &#128523;&#128523;&#128523;&#128523;
u8string            &#128523;&#128523;&#128523;&#128523;
string -&gt; u8string  &#128523;&#128523;&#128523;&#128523;

如何在C++20中将表情存储在char8_t中并打印它们?

如何在C++20中将表情存储在char8_t中并打印它们?

如何在C++20中将表情存储在char8_t中并打印它们?

答案3

得分: 2

当我尝试编译以下代码时,g++抛出以下错误:

期望的窄字符流和宽字符流的编码是与实现相关的,也可能依赖于最终打印到的终端所期望的编码方式。如果您想要打印到std::coutstd::wcout,您需要将您的字符转换为正确的编码,分别使用charwchar_t类型。

此外,为什么不能将表情符号放入char8_t或char16_t中?例如,以下代码行不起作用:

该表情符号的Unicode代码点是U+1F60B,无论是UTF-8还是UTF-16编码,都需要多个代码单元。但您正在尝试创建一个“字符文字”,它仅包含一个代码单元。

根据我的理解,表情符号是UTF-8字符[...]

这是没有意义的。UTF-8是一种用于Unicode代码点的编码方式。说一个字符“是UTF-8”的说法是没有意义的。这表明您可能对Unicode和字符/字符串编码的工作原理有根本性的误解。我建议您阅读一些关于这个主题的介绍。

英文:

> When I try to compile the code below, g++ throws the following error:

The encoding expected by the narrow and wide standard streams is implementation-dependent and may also depend on what the terminal you are ultimately printing to expects. You need to convert your character to the correct encoding as either char or wchar_t type if you want to print to std::cout or std::wcout respectively.

> Additionally, why can't I put the emoji into a char8_t or char16_t? For example, the following lines of code don't work:

The emoji is unicode code point U+1F60B which in both UTF-8 and UTF-16 encoding requires multiple code units. But you are trying to form a character literal, which holds only one code unit.

> From my understanding, emojis are UTF-8 characters [...]

That doesn't make sense. UTF-8 is an encoding for unicode code points. It doesn't make sense to say that a character "is UTF-8". This shows that you might have fundamental misunderstandings on how Unicode and character/string encodings in general work. I would suggest you read some introduction on the topic.

答案4

得分: 2

这部分的翻译如下:

这段代码有效
#include <iostream>

int main() {
  const char* c = "&#128523;";

  std::cout << c << std::endl;

  return 0;
}

解释。

  1. "😋" 是一个多字节序列,不能适应单个 char。因此应使用 const char*
  2. 默认源文件编码为 UTF-8,因此只能在 UTF-8 编码中使用 Unicode 字符。对于 char32_t,应编写为 U'\x1F60B'
  3. operator<<(std::basic_ostream) 对于 char8_tchar16_tchar32_t 被删除。
英文:

This works

#include &lt;iostream&gt;
int main() {
const char* c = &quot;&#128523;&quot;;
std::cout &lt;&lt; c &lt;&lt; std::endl;
return 0;
}

Explanation.

  1. 😋 is a multibyte sequence and does not fit in a single char. Thus const char* should be used.
  2. The default source file encoding is UTF-8, thus Unicode chars can be used only in UTF-8 encoding. For char32_t it should be written as U&#39;\x1F60B&#39;.
  3. operator&lt;&lt;(std::basic_ostream) is deleted for char8_t, char16_t and char32_t.

huangapple
  • 本文由 发表于 2023年2月27日 05:52:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75575229.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定