2023年8月9日 09:57:01go评论180阅读模式

英文:

How can I ensure gcc + libc has UTF-8 for multibyte strings and UTF-32 for wchar_t?

问题

我想知道如何强制GCC + GNU libc工具链采用正常的Unicode行为，其中源代码文件的编码为UTF-8，编译后的程序使用UTF-8作为其多字节字符集，并使用UTF-32LE作为其wchar_t，而不受任何区域设置信息的影响。

我希望能够在编译时知道它是否会起作用。

我知道通常的答案是使用setlocale(LC_ALL, "en_US.utf8")吗？但似乎只有在运行时才能知道setlocale(LC_ALL, "en_US.utf-8")是否会起作用，因为只有"C"和"POSIX"区域设置保证存在，并且除非我漏掉了什么，否则无法将区域设置编译到可执行文件中。

GCC有这些标志-finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le，但不清楚它们与setlocale()的工作方式。如果我使用了它们，是否需要调用setlocale()？它们会被setlocale()覆盖吗？

似乎应该有一种可靠的方法来强制gcc + libc采用正常的Unicode行为，而无需知道源系统或目标系统上预安装了哪些区域设置。

英文:

I want to know how to force a GCC + GNU libc toolchain into normal Unicode behaviour, where the source code files encoding is UTF-8, and where the compiled program uses UTF-8 as its multibyte character set and UTF-32LE as its wchar_t, regardless of any locale info.

And I want to be able to know at compile time that it is going to work.

I know the normal answer is to use setlocale(LC_ALL, "en_US.utf8")?, But it seems you can only know if setlocale(LC_ALL, "en_US.utf-8") is going to work at runtime, since only the "C" and "POSIX" locales are guaranteed to exist and, unless I'm missing something, you can't compile a locale into your executable.

GCC has these flags -finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le but it is unclear how they work with setlocale(). If I used them, do I need to call setlocale()? Are they overridden by setlocale()?

It seems like there should be some reliable way to force gcc + libc into normal Unicode behaviour without having to know what locales are preinstalled on the source or target systems.

答案1

得分: 2

这是不可能的，而且你也不想要。

由locale.h和wchar.h定义的接口比Unicode早十年，它们的数据模型建立在以下假设的基础上：

存在许多字符集和编码，没有一个 可以必然地表示程序可能需要处理的所有字符，其生命周期内。
然而，你的程序的任何一个单独的使用只需要处理一种语言中的文本，并且采用一种编码。
操作系统的一个安装只需要处理少量语言中的文本，在安装时可知。

这三个假设现在都是无效的。相反，我们有：

存在一个单一的字符集（Unicode），其设计目标是表示_所有_ 世界上的现存书写语言（我们能够实现这一目标的程度取决于你与之交谈的人以及你对Weinreich's Maxim的认真程度）。
只需要担心_所有Unicode_的几种编码，但仍然常常会遇到映射到Unicode的_子集_的8位编码数据，而且有数十种这样的编码。
一个程序的单次运行通常需要处理多种语言和多种不同的编码的文本。你通常可以假设一个_文件_全部使用一种编码，但不能保证你不会被要求合并来自UTF-8、ISO-8859-2和KOI8-R（例如）等编码的数据来源。
"安装"（一个公司，一个系统管理员，一些共享的小型计算机，几十或几百个lusers）的整个概念已经过时，以及你不会明天醒来发现你收到了以前从未听说过的脚本的电子邮件 --- 计算机仍然预期能够正确渲染它并识别它以进行机器翻译。

由于数据模型不再适用，接口也不再适用。我诚实地建议你忘记你曾经听说过locale.h或涉及wchar_t的任何ISO C或POSIX接口。而是使用第三方库（例如ICU），其数据模型更适合现代世界。

最近已经向C标准添加了特定于UTF-n（n=8, 16, 32）编码的字符和字符串类型，在原则上它们应该改善这种情况，但我对它们没有任何经验，并且据我所知，标准库几乎不关心它们。

（有关locale.h和/或wchar_t API的不足以及改进C标准库的现状的更多详细信息，请参阅https://thephd.dev/cuneicode-and-the-future-of-text-in-c。）

英文:

This is not possible, and you don't want it anyway.

The interfaces defined by locale.h and wchar.h are a decade older than Unicode, and their data model is built around these assumptions:

There are many character sets and encodings, and none of them can necessarily represent all the characters your program might need to be able to handle over its lifetime.
However, any single use of your program will only need to process text from one language, and in one encoding.
Any one installation of the operating system will only need to process text in a small number of languages, knowable at installation time.

All three of these assumptions are invalid nowadays. Instead we have:

There is a single character set (Unicode) whose design goal is to represent all of the world's living written languages (how close we come to achieving that goal depends on who you talk to and how seriously you take Weinreich's Maxim).
There are only a few encodings of all of Unicode to worry about, but data in 8-bit encodings that map to a subset of Unicode is still commonly encountered, and there are dozens of these.
It is normal for a single run of a program to need to process text in multiple languages and in many different encodings. You can usually assume that a single file is all in one encoding, but not that you won't be called upon to merge data from sources in UTF-8, ISO-8859-2, and KOI8-R (for example).
The whole concept of an "installation" (one corporation, one sysadmin, a handful of shared minicomputers, tens or hundreds of lusers) is obsolete, and so is the idea that you won't wake up tomorrow and discover you've received email in a script you'd never even heard of before --- and the computer is still expected to render it correctly and recognize it for machine translation.

Because the data model is no good anymore, so too are the interfaces. My honest recommendation is that you forget you ever heard of locale.h or any ISO C or POSIX interface that deals in wchar_t. Instead use a third-party library (e.g. ICU) whose data model is a better fit for the modern world.

Types for characters and strings specifically encoded in UTF-n (n=8, 16, 32) have recently been added to the C standard, and in principle they should make this situation better, but I don't have any experience with them, and as far as I can tell the standard library barely takes notice of them.

(For more detail on the failings of the locale.h and/or wchar_t APIs and the present state of efforts to improve the C standard library, see <https://thephd.dev/cuneicode-and-the-future-of-text-in-c>.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何确保gcc + libc对于多字节字符串使用UTF-8，对于wchar_t使用UTF-32？

问题

答案1

将指向数组的指针作为指向指针的指针传递在C中是未定义行为吗？

error caused by boolean on C( I'm a beginner so please try to be simple)

ARR在C语言中半场的意思是什么？

Is there a good way of converting an uint8_t array into a single float value that doesn't involve a loop?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论