英文:
How to Store Emojis in char8_t and Print Them Out in C++20?
问题
我刚刚了解到char8_t
,char16_t
和char32_t
的存在,并正在测试它。当我尝试编译下面的代码时,g++
抛出以下错误:
error: use of deleted function ‘std::basic_ostream<char, _Traits>& std::operator<<(std::basic_ostream<char, _Traits>&, char32_t) [with _Traits = std::char_traits<char>]’
6 | std::cout << U'🙂' << std::endl;
| ^~~~~
此外,为什么我不能将表情符号放入char8_t
或char16_t
中呢?例如,以下代码行不起作用:
char16_t c1 = u'🙂';
char8_t c2 = u8'🙂';
auto c3 = u'🙂';
auto c4 = u8'🙂';
据我理解,表情符号是UTF-8字符,因此应该适应char8_t
。
英文:
I just now heard about the existence of char8_t
, char16_t
and char32_t
and I am testing it out. When I try to compile the code below, g++
throws the following error:
error: use of deleted function ‘std::basic_ostream<char, _Traits>& std::operator<<(basic_ostream<char, _Traits>&, char32_t) [with _Traits = char_traits<char>]’
6 | std::cout << U'😋' << std::endl;
| ^~~~~
#include <iostream>
int main() {
char32_t c = U'😋';
std::cout << c << std::endl;
return 0;
}
Additionally, why can't I put the emoji into a char8_t
or char16_t
? For example, the following lines of code don't work:
char16_t c1 = u'😋';
char8_t c2 = u8'😋';
auto c3 = u'😋';
auto c4 = u8'😋';
From my understanding, emojis are UTF-8 characters and should therefore fit into a char8_t
.
答案1
得分: 5
表情符号是UTF-8字符
并没有所谓的“UTF-8字符”。
存在Unicode代码点。这些可以用UTF-8编码表示,以便每个代码点映射到一个或多个UTF-8代码单元的序列:char8_t
。但这意味着大多数代码点映射到多个char8_t
,也就是字符串。而表情符号不属于将映射到单个UTF-8代码单元的127个代码点之一。
特别是表情符号可以由多个代码点构建,所以即使使用UTF-32,也不能保证任何表情符号都可以存储在单个char32_t
代码点中。
最好始终将这些东西视为字符串,而不是字符。忘记“字符”甚至存在。
英文:
> emojis are UTF-8 characters
There is no such thing as a "UTF-8 character".
There are Unicode codepoints. These can be represented in the UTF-8 encoding, such that each codepoint maps to a sequence of one or more UTF-8 code units: char8_t
s. But that means that most codepoints map to multiple char8_t
s: AKA, a string. And Emojis are not among the 127 codepoints that map to a single UTF-8 code unit.
Emoji in particular can be built out of multiple codepoints, so even using UTF-32, you cannot guarantee that any emoji could be stored in a single char32_t
codepoint.
It's best to treat these things as strings, not characters, at all times. Forget that "characters" even exist.
答案2
得分: 3
// 在Visual C++中使用Windows终端进行测试。
// 在Windows终端中使用Visual C++测试。
// GCC [https://godbolt.org/z/cMbeoGf9a][3]
// Clang [https://godbolt.org/z/EhfdaM61x][4]
// 如果在Windows上编译,请设置控制台输出为UTF-8。
// 如果在Windows上编译,请设置控制台输入和输出为UTF-8。
// 如果支持char8_t特性,进行以下操作:
// 重载输出运算符<<,以输出std::u8string。
// 将std::u8string转换为std::string。
// 将std::string转换为std::u8string。
// 将const char8_t*字面量转换为std::string。
// 使用std::string_literals命名空间。
// u8string文字字面量。
// 使用"_s"运算符将utf8字面量(const char8_t*)转换为std::string。
// 使用std::string_literals操作符。
#include <iostream>
#ifdef _WIN32
#include <Windows.h>
#define SET_CONSOLE_UTF8 SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8); // 在Windows上设置控制台输出为UTF-8。
#endif // _WIN32
#if defined(__cpp_char8_t) | defined(__cpp_lib_char8_t)
std::ostream& operator<<(std::ostream& os, const std::u8string& str)
{
os << reinterpret_cast<const char*>(str.data());
return os;
}
std::string ToString(const std::u8string& s) {
return std::string(s.begin(), s.end());
}
std::u8string Tou8String(const std::string& s) {
return std::u8string(s.begin(), s.end());
}
static inline std::string operator"" _s(const char8_t* value, size_t size) {
static std::string x(reinterpret_cast<const char*>(value), size);
return x;
}
#endif
using namespace std::string_literals;
int main() {
#ifdef _WIN32
SET_CONSOLE_UTF8
#endif
std::u8string u8String = u8"&#128523;&#128523;&#128523;&#128523;"s;
std::string str = u8"&#128523;&#128523;&#128523;&#128523;"_s;
std::cout << "string " << str << std::endl;
std::cout << "u8string -> string " << ToString(u8String) << std::endl;
std::cout << "u8string " << u8String << std::endl;
std::cout << "string -> u8string " << Tou8String(str) << std::endl;
std::cin.get();
return 0;
}
string 😋😋😋😋
u8string -> string 😋😋😋😋
u8string 😋😋😋😋
string -> u8string 😋😋😋😋
【1】: https://learn.microsoft.com/en-us/windows/terminal/install
【2】: https://github.com/JomaStackOverflowAnswers/EmojiCpp20
【3】: https://godbolt.org/z/cMbeoGf9a
【4】: https://godbolt.org/z/EhfdaM61x
【5】: https://godbolt.org/
【6】: https://i.stack.imgur.com/fo3Mn.png
【7】: https://i.stack.imgur.com/MTClD.png
【8】: https://i.stack.imgur.com/bcLdT.png
英文:
Code
- Tested in Visual C++ using Windows Terminal.
https://github.com/JomaStackOverflowAnswers/EmojiCpp20 - GCC https://godbolt.org/z/cMbeoGf9a
- Clang https://godbolt.org/z/EhfdaM61x
#include <iostream>
#ifdef _WIN32
#include <Windows.h>
#define SET_CONSOLE_UTF8 SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8); //Set console output to UTF-8.Visual C++ code on Windows.
#endif // _WIN32
#if defined(__cpp_char8_t) | defined(__cpp_lib_char8_t)
//Operator <<
std::ostream& operator<<(std::ostream& os, const std::u8string& str)
{
os << reinterpret_cast<const char*>(str.data());
return os;
}
//Convert u8string to string.
std::string ToString(const std::u8string& s) {
return std::string(s.begin(), s.end());
}
std::u8string Tou8String(const std::string& s) {
return std::u8string(s.begin(), s.end());
}
//const char8_t* literal to string. Operator ""_s
static inline std::string operator"" _s(const char8_t* value, size_t size) {
static std::string x(reinterpret_cast<const char*>(value), size);
return x;
}
#endif
using namespace std::string_literals;// operator ""s
int main() {
#ifdef _WIN32
SET_CONSOLE_UTF8
#endif
std::u8string u8String = u8"😋😋😋😋"s;// u8string literal.
std::string str = u8"😋😋😋😋"_s; //Operator "_s". Convert utf8 literal(const char8_t*) to std::string.
std::cout << "string " << str << std::endl; //Using operator << for std::string
std::cout << "u8string -> string " << ToString(u8String) << std::endl; //Using function ToString(u8string) -> string
std::cout << "u8string " << u8String << std::endl; //Using operator << for std::u8string.
std::cout << "string -> u8string " << Tou8String(str) << std::endl; //Using function Tou8String(string) -> u8string
std::cin.get();
return 0;
}
Output Windows Terminal and https://godbolt.org/(Clang and GCC)
string 😋😋😋😋
u8string -> string 😋😋😋😋
u8string 😋😋😋😋
string -> u8string 😋😋😋😋
答案3
得分: 2
当我尝试编译以下代码时,g++抛出以下错误:
期望的窄字符流和宽字符流的编码是与实现相关的,也可能依赖于最终打印到的终端所期望的编码方式。如果您想要打印到std::cout
或std::wcout
,您需要将您的字符转换为正确的编码,分别使用char
或wchar_t
类型。
此外,为什么不能将表情符号放入char8_t或char16_t中?例如,以下代码行不起作用:
该表情符号的Unicode代码点是U+1F60B,无论是UTF-8还是UTF-16编码,都需要多个代码单元。但您正在尝试创建一个“字符文字”,它仅包含一个代码单元。
根据我的理解,表情符号是UTF-8字符[...]
这是没有意义的。UTF-8是一种用于Unicode代码点的编码方式。说一个字符“是UTF-8”的说法是没有意义的。这表明您可能对Unicode和字符/字符串编码的工作原理有根本性的误解。我建议您阅读一些关于这个主题的介绍。
英文:
> When I try to compile the code below, g++ throws the following error:
The encoding expected by the narrow and wide standard streams is implementation-dependent and may also depend on what the terminal you are ultimately printing to expects. You need to convert your character to the correct encoding as either char
or wchar_t
type if you want to print to std::cout
or std::wcout
respectively.
> Additionally, why can't I put the emoji into a char8_t or char16_t? For example, the following lines of code don't work:
The emoji is unicode code point U+1F60B which in both UTF-8 and UTF-16 encoding requires multiple code units. But you are trying to form a character literal, which holds only one code unit.
> From my understanding, emojis are UTF-8 characters [...]
That doesn't make sense. UTF-8 is an encoding for unicode code points. It doesn't make sense to say that a character "is UTF-8". This shows that you might have fundamental misunderstandings on how Unicode and character/string encodings in general work. I would suggest you read some introduction on the topic.
答案4
得分: 2
这部分的翻译如下:
这段代码有效
#include <iostream>
int main() {
const char* c = "😋";
std::cout << c << std::endl;
return 0;
}
解释。
- "😋" 是一个多字节序列,不能适应单个
char
。因此应使用const char*
。 - 默认源文件编码为 UTF-8,因此只能在 UTF-8 编码中使用 Unicode 字符。对于
char32_t
,应编写为U'\x1F60B'
。 operator<<(std::basic_ostream)
对于char8_t
、char16_t
和char32_t
被删除。
英文:
This works
#include <iostream>
int main() {
const char* c = "😋";
std::cout << c << std::endl;
return 0;
}
Explanation.
- 😋 is a multibyte sequence and does not fit in a single
char
. Thusconst char*
should be used. - The default source file encoding is UTF-8, thus Unicode chars can be used only in UTF-8 encoding. For
char32_t
it should be written asU'\x1F60B'
. operator<<(std::basic_ostream)
is deleted forchar8_t
,char16_t
andchar32_t
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论