如何确保gcc + libc对于多字节字符串使用UTF-8,对于wchar_t使用UTF-32?

huangapple go评论86阅读模式
英文:

How can I ensure gcc + libc has UTF-8 for multibyte strings and UTF-32 for wchar_t?

问题

我想知道如何强制GCC + GNU libc工具链采用正常的Unicode行为,其中源代码文件的编码为UTF-8,编译后的程序使用UTF-8作为其多字节字符集,并使用UTF-32LE作为其wchar_t,而不受任何区域设置信息的影响。

我希望能够在编译时知道它是否会起作用。

我知道通常的答案是使用setlocale(LC_ALL, "en_US.utf8")吗?但似乎只有在运行时才能知道setlocale(LC_ALL, "en_US.utf-8")是否会起作用,因为只有"C"和"POSIX"区域设置保证存在,并且除非我漏掉了什么,否则无法将区域设置编译到可执行文件中。

GCC有这些标志-finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le,但不清楚它们与setlocale()的工作方式。如果我使用了它们,是否需要调用setlocale()?它们会被setlocale()覆盖吗?

似乎应该有一种可靠的方法来强制gcc + libc采用正常的Unicode行为,而无需知道源系统或目标系统上预安装了哪些区域设置。

英文:

I want to know how to force a GCC + GNU libc toolchain into normal Unicode behaviour, where the source code files encoding is UTF-8, and where the compiled program uses UTF-8 as its multibyte character set and UTF-32LE as its wchar_t, regardless of any locale info.

And I want to be able to know at compile time that it is going to work.

I know the normal answer is to use setlocale(LC_ALL, "en_US.utf8")?, But it seems you can only know if setlocale(LC_ALL, "en_US.utf-8") is going to work at runtime, since only the "C" and "POSIX" locales are guaranteed to exist and, unless I'm missing something, you can't compile a locale into your executable.

GCC has these flags -finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le but it is unclear how they work with setlocale(). If I used them, do I need to call setlocale()? Are they overridden by setlocale()?

It seems like there should be some reliable way to force gcc + libc into normal Unicode behaviour without having to know what locales are preinstalled on the source or target systems.

答案1

得分: 2

这是不可能的,而且你也不想要。

locale.hwchar.h定义的接口比Unicode早十年,它们的数据模型建立在以下假设的基础上:

  1. 存在许多字符集和编码,没有一个 可以必然地表示程序可能需要处理的所有字符,其生命周期内。
  2. 然而,你的程序的任何一个单独的使用只需要处理一种语言中的文本,并且采用一种编码
  3. 操作系统的一个安装只需要处理少量语言中的文本,在安装时可知。

这三个假设现在都是无效的。相反,我们有:

  1. 存在一个单一的字符集(Unicode),其设计目标是表示_所有_ 世界上的现存书写语言(我们能够实现这一目标的程度取决于你与之交谈的人以及你对Weinreich's Maxim的认真程度)。
  2. 只需要担心_所有Unicode_的几种编码,但仍然常常会遇到映射到Unicode的_子集_的8位编码数据,而且有数十种这样的编码。
  3. 一个程序的单次运行通常需要处理多种语言和多种不同的编码的文本。你通常可以假设一个_文件_全部使用一种编码,但不能保证你不会被要求合并来自UTF-8、ISO-8859-2和KOI8-R(例如)等编码的数据来源。
  4. "安装"(一个公司,一个系统管理员,一些共享的小型计算机,几十或几百个lusers)的整个概念已经过时,以及你不会明天醒来发现你收到了以前从未听说过的脚本的电子邮件 --- 计算机仍然预期能够正确渲染它并识别它以进行机器翻译。

由于数据模型不再适用,接口也不再适用。我诚实地建议你忘记你曾经听说过locale.h或涉及wchar_t的任何ISO C或POSIX接口。而是使用第三方库(例如ICU),其数据模型更适合现代世界。

最近已经向C标准添加了特定于UTF-n(n=8, 16, 32)编码的字符和字符串类型,在原则上它们应该改善这种情况,但我对它们没有任何经验,并且据我所知,标准库几乎不关心它们。

(有关locale.h和/或wchar_t API的不足以及改进C标准库的现状的更多详细信息,请参阅https://thephd.dev/cuneicode-and-the-future-of-text-in-c。)

英文:

This is not possible, and you don't want it anyway.

The interfaces defined by locale.h and wchar.h are a decade older than Unicode, and their data model is built around these assumptions:

  1. There are many character sets and encodings, and none of them can necessarily represent all the characters your program might need to be able to handle over its lifetime.
  2. However, any single use of your program will only need to process text from one language, and in one encoding.
  3. Any one installation of the operating system will only need to process text in a small number of languages, knowable at installation time.

All three of these assumptions are invalid nowadays. Instead we have:

  1. There is a single character set (Unicode) whose design goal is to represent all of the world's living written languages (how close we come to achieving that goal depends on who you talk to and how seriously you take Weinreich's Maxim).
  2. There are only a few encodings of all of Unicode to worry about, but data in 8-bit encodings that map to a subset of Unicode is still commonly encountered, and there are dozens of these.
  3. It is normal for a single run of a program to need to process text in multiple languages and in many different encodings. You can usually assume that a single file is all in one encoding, but not that you won't be called upon to merge data from sources in UTF-8, ISO-8859-2, and KOI8-R (for example).
  4. The whole concept of an "installation" (one corporation, one sysadmin, a handful of shared minicomputers, tens or hundreds of lusers) is obsolete, and so is the idea that you won't wake up tomorrow and discover you've received email in a script you'd never even heard of before --- and the computer is still expected to render it correctly and recognize it for machine translation.

Because the data model is no good anymore, so too are the interfaces. My honest recommendation is that you forget you ever heard of locale.h or any ISO C or POSIX interface that deals in wchar_t. Instead use a third-party library (e.g. ICU) whose data model is a better fit for the modern world.

Types for characters and strings specifically encoded in UTF-n (n=8, 16, 32) have recently been added to the C standard, and in principle they should make this situation better, but I don't have any experience with them, and as far as I can tell the standard library barely takes notice of them.

(For more detail on the failings of the locale.h and/or wchar_t APIs and the present state of efforts to improve the C standard library, see <https://thephd.dev/cuneicode-and-the-future-of-text-in-c>.)

huangapple
  • 本文由 发表于 2023年8月9日 09:57:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76864116-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定