如何确保gcc + libc对于多字节字符串使用UTF-8编码,对于wchar_t使用UTF-32编码?

huangapple go评论82阅读模式
英文:

How can I ensure gcc + libc has UTF-8 for multibyte strings and UTF-32 for wchar_t?

问题

我想知道如何将GCC + GNU libc工具链强制转换为正常的Unicode行为,其中源代码文件的编码为UTF-8,编译后的程序使用UTF-8作为其多字节字符集,以及UTF-32LE作为其wchar_t,而不考虑任何区域设置信息。

并且我希望能够在编译时知道它是否能正常工作。

我知道通常的答案是使用setlocale(LC_ALL, "en_US.utf8"),但似乎只有在运行时才能知道setlocale(LC_ALL, "en_US.utf-8")是否能正常工作,因为只有"C"和"POSIX"区域设置被保证存在,并且除非我漏掉了什么,否则无法将区域设置编译到可执行文件中。

GCC有这些标志-finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le,但不清楚它们如何与setlocale()一起工作。如果我使用了它们,是否需要调用setlocale()?它们会被setlocale()覆盖吗?

似乎应该有一种可靠的方法来强制gcc + libc以正常的Unicode行为运行,而不需要知道源系统或目标系统上预安装了哪些区域设置。

英文:

I want to know how to force a GCC + GNU libc toolchain into normal Unicode behaviour, where the source code files encoding is UTF-8, and where the compiled program uses UTF-8 as its multibyte character set and UTF-32LE as its wchar_t, regardless of any locale info.

And I want to be able to know at compile time that it is going to work.

I know the normal answer is to use setlocale(LC_ALL, "en_US.utf8")?, But it seems you can only know if setlocale(LC_ALL, "en_US.utf-8") is going to work at runtime, since only the "C" and "POSIX" locales are guaranteed to exist and, unless I'm missing something, you can't compile a locale into your executable.

GCC has these flags -finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le but it is unclear how they work with setlocale(). If I used them, do I need to call setlocale()? Are they overridden by setlocale()?

It seems like there should be some reliable way to force gcc + libc into normal Unicode behaviour without having to know what locales are preinstalled on the source or target systems.

答案1

得分: 2

这是不可能的,而且你也不想要这样。

locale.hwchar.h 定义的接口比 Unicode 还要早十年,它们的数据模型是基于以下假设构建的:

  1. 存在许多字符集和编码,但没有一个可以表示程序在其生命周期内可能需要处理的所有字符。
  2. 然而,程序的每次使用只需要处理一种语言的文本,并且使用一种编码
  3. 操作系统的每个安装只需要处理少量语言的文本,这些语言在安装时就可以确定。

这三个假设在现在都是无效的。相反,我们有:

  1. 存在一个单一的字符集(Unicode),其设计目标是表示世界上所有的书面语言(我们在实现这个目标上有多接近取决于你与谁交谈以及你对Weinreich's Maxim的认真程度)。
  2. 只有少数几种编码需要关注,这些编码可以表示Unicode 的全部,但仍然常见遇到映射到Unicode 子集的 8 位编码,并且有数十种这样的编码。
  3. 一个程序的单次运行通常需要处理多种语言和多种不同的编码。你通常可以假设一个文件全部使用一种编码,但不能排除你需要合并来自 UTF-8、ISO-8859-2 和 KOI8-R(例如)编码的数据的情况。
  4. "安装" 的整个概念(一个公司、一个系统管理员、一些共享的小型计算机、数十或数百个lusers)已经过时,而且你明天醒来可能会发现你收到了一封你从未听说过的脚本的电子邮件 --- 计算机仍然期望能正确地呈现它并识别出它以进行机器翻译。

由于数据模型不再适用,接口也不再适用。我真诚地建议你忘记你曾经听说过 locale.h 或任何处理 wchar_t 的 ISO C 或 POSIX 接口。相反,使用第三方库(例如 ICU),其数据模型更适合现代世界。

最近,C 标准添加了专门用于以 UTF-n(n=8、16、32)编码的字符和字符串的类型,原则上它们应该改善这种情况,但我没有任何使用它们的经验,并且据我所知,标准库几乎没有注意到它们。

(有关 locale.h 和/或 wchar_t API 的缺陷以及改进 C 标准库的现状的更多详细信息,请参阅 https://thephd.dev/cuneicode-and-the-future-of-text-in-c。)

英文:

This is not possible, and you don't want it anyway.

The interfaces defined by locale.h and wchar.h are a decade older than Unicode, and their data model is built around these assumptions:

  1. There are many character sets and encodings, and none of them can necessarily represent all the characters your program might need to be able to handle over its lifetime.
  2. However, any single use of your program will only need to process text from one language, and in one encoding.
  3. Any one installation of the operating system will only need to process text in a small number of languages, knowable at installation time.

All three of these assumptions are invalid nowadays. Instead we have:

  1. There is a single character set (Unicode) whose design goal is to represent all of the world's living written languages (how close we come to achieving that goal depends on who you talk to and how seriously you take Weinreich's Maxim).
  2. There are only a few encodings of all of Unicode to worry about, but data in 8-bit encodings that map to a subset of Unicode is still commonly encountered, and there are dozens of these.
  3. It is normal for a single run of a program to need to process text in multiple languages and in many different encodings. You can usually assume that a single file is all in one encoding, but not that you won't be called upon to merge data from sources in UTF-8, ISO-8859-2, and KOI8-R (for example).
  4. The whole concept of an "installation" (one corporation, one sysadmin, a handful of shared minicomputers, tens or hundreds of lusers) is obsolete, and so is the idea that you won't wake up tomorrow and discover you've received email in a script you'd never even heard of before --- and the computer is still expected to render it correctly and recognize it for machine translation.

Because the data model is no good anymore, so too are the interfaces. My honest recommendation is that you forget you ever heard of locale.h or any ISO C or POSIX interface that deals in wchar_t. Instead use a third-party library (e.g. ICU) whose data model is a better fit for the modern world.

Types for characters and strings specifically encoded in UTF-n (n=8, 16, 32) have recently been added to the C standard, and in principle they should make this situation better, but I don't have any experience with them, and as far as I can tell the standard library barely takes notice of them.

(For more detail on the failings of the locale.h and/or wchar_t APIs and the present state of efforts to improve the C standard library, see <https://thephd.dev/cuneicode-and-the-future-of-text-in-c>.)

huangapple
  • 本文由 发表于 2023年8月9日 09:57:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76864116.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定