英文:
UTF-8 does not print characters to the console
问题
我有以下的代码
```java
public class MainDefault {
public static void main(String[] args) {
System.out.println("你好");
System.out.println(Arrays.toString("你好".getBytes()));
}
}
但是似乎无法在控制台中打印出特殊字符。
当我执行以下操作时,得到如下结果
$ javac MainDefault.java
$ java MainDefault
另一方面,当我像这样编译并运行它
$ javac -encoding UTF8 MainDefault.java
$ java MainDefault
并且当我使用文件编码UTF8标志运行它时,我得到以下结果
$ java -Dfile.encoding=UTF8 MainDefault
这似乎与控制台无关(在Windows 10上使用Git Bash),因为它正常打印出字符
感谢您的帮助
<details>
<summary>英文:</summary>
I have the following code
public class MainDefault {
public static void main (String[] args) {
System.out.println("²³");
System.out.println(Arrays.toString("²³".getBytes()));
}
}
But can't seem to print the special characters to the console
When I do the following, I get the following result
$ javac MainDefault.java
$ java MainDefault
[![MainDefaultPrinting][1]][1]
On the other hand, when I compile it and run it like this
$ javac -encoding UTF8 MainDefault.java
$ java MainDefault
[![MainDefaultUTF8CompilationOnly][2]][2]
And when I run it using the file encoding UTF8 flag, I get the following
$ java -Dfile.encoding=UTF8 MainDefault
[![MainDefaultUTF8CompilationAndRun][3]][3]
It's doesn't seem to be a problem with the console (Git Bash on Windows 10), as it prints the characters normally
[![Echo][4]][4]
Thanks for your help
[1]: https://i.stack.imgur.com/YqKEd.png
[2]: https://i.stack.imgur.com/wrvzm.png
[3]: https://i.stack.imgur.com/PSafY.png
[4]: https://i.stack.imgur.com/URPsy.png
</details>
# 答案1
**得分**: 12
你的代码没有在控制台中正确打印出字符,这是因为你的Java程序和控制台正在使用不同的字符集和编码。
如果你想获得相同的字符输出,首先需要确定当前使用的字符集。
这个过程取决于你输出结果的“控制台”。
如果你在Windows中使用```cmd```,正如@RickJames建议的那样,你可以使用```chcp```命令来确定活动的代码页。
Oracle提供了Java完整支持的编码信息,以及与其他别名(在这种情况下是代码页)的对应关系,你可以在[这个页面](https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html)中找到这些信息。
[这个stackoverflow答案](https://stackoverflow.com/questions/46047993-is-there-any-mapping-table-between-windows-code-and-java-charset)还提供了关于Windows代码页和Java字符集映射的一些指导。
正如你在提供的链接中所看到的,```UTF-8```的代码页是```65001```。
如果你使用Git Bash(MinTTY),你可以按照@kriegaex的说明来验证或配置```UTF-8```作为终端仿真器的编码。
Linux、UNIX或类似Mac OS的UNIX衍生系统不使用代码页标识,而是使用地区设置(locale)。地区设置信息可能在不同系统间有所不同,但你可以使用```locale```命令或尝试检查```LC_*```系统变量以找到所需信息。
以下是我系统上```locale```命令的输出:
```shell
LANG="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_CTYPE="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_ALL=
一旦你了解了这些信息,你需要使用相应的file.encoding
虚拟机选项来运行你的Java程序,以匹配正确的字符集:
java -Dfile.encoding=UTF8 MainDefault
一些类,比如PrintStream
或PrintWriter
,允许你指定输出信息所使用的Charset
。
-encoding
选项只能用于指定源文件的字符编码。
如果你在Windows上使用Git Bash,请考虑阅读@rmunge的答案:它提供了有关该工具可能存在的bug的信息,这可能是问题的原因,并且可能需要手动进行编码调整才能正确运行终端。
英文:
Your code are not printing the right characters in the console because your Java program and the console are using different character sets, different encodings.
If you want to obtain the same characters, you first need to determine which character sets are in place.
This process will depend on the "console" in which you are outputting your results.
If you are working with Windows and cmd
, as @RickJames suggested, you can use the chcp
command to determine the active code page.
Oracle provides the Java full supported encodings information, and the correspondence with other alias - code pages in this case - in this page.
This stackoverflow answer also provides some guidance about the mapping between Windows Code Pages and Java charsets.
As you can see in the provided links, the code page for UTF-8
is 65001
.
If you are using Git Bash (MinTTY), you can follow @kriegaex instructions to verify or configure UTF-8
as the terminal emulator encoding.
Linux and UNIX, or UNIX derived systems like Mac OS, do not use code page identifiers, but locales. The locale information can vary between systems, but you can either use the locale
command or try to inspect the LC_*
system variables to find the required information.
This is the output of the locale
command in my system:
LANG="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_CTYPE="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_ALL=
Once you know this information, you need to run your Java program with the file.encoding
VM option corresponding to the right charset:
java -Dfile.encoding=UTF8 MainDefault
Some classes, like PrintStream
or PrintWriter
, allows you to indicate the Charset
in which the information will be outputted.
The -encoding
javac
option only allows you to specify the character encoding used by source files.
If you are using Windows with Git Bash, consider also reading this @rmunge answer: it provides information about a possible bug in the tool that may be the reason for the problem and that prevents the terminal from running correctly out of the box without the need for manual encoding adjustments.
答案2
得分: 5
我也在 Windows 10 上使用 Git Bash,对我来说它完全正常运行。
以下是它的输出方式,
终端版本为 mintty 3.0.2 (x86_64-pc-msys)
,我的文本属性如下,
因此,我尝试通过更改字符集来复现你的输出;
通过将字符集设置为 CP437 (OEM 代码页)
(注意这同时自动将区域设置更改为 C
),我可以得到与你相同的输出结果。
然后,当我将其改回 UTF-8 (Unicode)
后,我可以得到预期的输出!
因此,很明显问题出在你的控制台字符集上。
英文:
I am also using the Git Bash on Windows 10 and It works totally fine for me.
Here's how it prints,
Terminal version is mintty 3.0.2 (x86_64-pc-msys)
and My text properties were,
So, I tried to reproduce your outputs by changing Character Sets;
By setting Character Set to CP437 (OEM codepage)
(Note that this automatically changed Locale to C
too), I could be able to get the output as you got.
And then after when I change it back to UTF-8 (Unicode)
, the I could get the output as expected!
Therefore, it is clear that the problem is with your console's Character Set.
答案3
得分: 5
The short version:
意外的行为可以通过以下设置进行复现:
- Windows 10系统,使用英语、德语、法语或任何其他导致ANSI和OEM代码页编码²和³不同的语言
- Git for Windows 2.27.0(使用默认设置安装),配置为使用MinTTY并禁用伪控制台的实验性支持
- 源代码以UTF-8编码存储
要获得正确的行为:
- 要么重新安装Git for Windows 2.27.0,并在安装程序的最后一页启用伪控制台的实验性支持,或者升级到最新的2.28版本
- 使用javac -encoding UTF8编译代码
- 调用java时不要覆盖file.encoding
The medium version:
Git for Windows 2.27.0使用了一个版本的MSYS2,该版本在不支持伪控制台的情况下,通过调用SetConsoleCP来设置MinTTY的代码页。Java运行时通过调用GetConsoleCP来确定<code>System.out</code>的代码页。由于在MinTTY终端内执行Java时没有设置代码页,调用失败,Java使用<code>Charset.defaultCharset()</code>返回的字符集作为回退。但在上述描述的Windows安装中,<code>Charset.defaultCharset()</code>返回Cp-1252,而控制台的默认字符集是Cp-850。这两个代码页并不完全兼容。这导致了奇怪的输出。
The long version:
Windows有两种类型的代码页:ANSI和OEM代码页。第一种类型用于不支持Unicode的UI应用程序,而后一种用于控制台应用程序。这两种类型都将单个字符编码为1字节,但它们并不完全兼容。
因此,在Windows上,Java不得不处理两个字符集,而不是一个:
- <code>Charset.defaultCharset()</code>返回ANSI代码页(通常为cp-1252)。这个字符集由_file.encoding_系统属性指定。如果未作为VM参数指定,java可执行文件会确定ANSI代码页,并在初始化期间添加系统属性。<code>String.getBytes()</code>使用<code>Charset.defaultCharset()</code>返回的字符集。
- <code>System.out</code>对于控制台使用OEM代码页(通常为cp-850)。Java可执行文件通过调用GetConsoleCP函数获取此代码页,并将其设置为内部系统属性_sun.stdout.encoding_和_sun.stdout.encoding_的值。当调用GetConsoleCP失败时,使用<code>Charset.defaultCharset()</code>返回的字符集。这只在尚未在其中运行java.exe的控制台在调用SetConsoleCP之前发生。
那么现在在上面提到的设置中会发生什么呢?
$ javac MainDefault.java
$ java MainDefault
由于MSYS2中的错误,GetConsoleCP的本机调用失败。因此<code>System.out</code>回退到<code>Charset.defaultCharset()</code>返回的字符集,即cp-1252。但控制台的OEM代码页是cp-850。因此System.out.println("²³")会产生意外的输出。
源代码以UTF-8存储。在UTF-8中编码"²³"需要4字节。但由于缺少_-encoding_参数,javac假设使用每个字符一个字节的默认编码。因此,它将4字节解释为4个字符。<code>String.getBytes</code>使用基于1字节的ANSI代码页cp-1252进行编码,因此返回4字节。
$ javac -encoding UTF8 MainDefault.java
$ java MainDefault
使用_-encoding UTF8_参数,javac将UTF-8编码的源代码解释为UTF-8。因此,"²³"的4个字节会被正确识别为两个字符。<code>System.out</code>将这两个字符编码为cp-1252,生成2个字节。但由于控制台仍然使用cp-850,输出仍然损坏。<code>String.getBytes</code>也将这两个字符编码为cp-1252,生成2个字节。
$ java -Dfile.encoding=UTF8 MainDefault
系统属性_file.encoding_会覆盖<code>Charset.defaultCharset()</code>返回的字符集,该字符集也被<code>String.getBytes()</code>使用。首先被错误解释为8位编码中的4个字符的两个字符现在以每个字符两个字节的UTF-8正确编码为两个字符。这导致4字节。由于_file.encoding_对<code>System.out</code>使用的字符集没有任何影响,所以这4个字符(由于javac的错误解释,不是2个字符)仍然以cp-1252编码,控制台仍然使用cp-850,仍然得到损坏的输出。
您的控制台可以打印出²³,因为控制台的8位OEM代码页(cp-850)支持这两个字符。但它以与<code>System.out</code>使用的ANSI代码页cp-1252略有不同的方式进行编码;-)
英文:
The short version:
The unexpected behavior is reproducible with the following setup:
-
Windows 10 with English, German or French language, or any other language that leads to ANSI and OEM codepages that encode ² and ³ differently
-
Git for Windows 2.27.0 (installed with default setting i.e.
configured to use MinTTY and experimental support for pseudo consoles
disabled) -
Source code is stored in UTF-8 encoding
To get correct bahavior:
-
Either re-install Git for Windows 2.27.0 and enable experimental
support for pseudo consoles on the last page of the installer or
upgrade to latest 2.28 version -
Compile your code with javac -encoding UTF8
-
Call java without overriding file.encoding
The medium version:
Git for Windows 2.27.0 uses a version of MSYS2 that does not set the code page for MinTTY by calling SetConsoleCP when support for pseudo consoles is disabled. The Java runtime determines the codepage for <code>System.out</code> by calling GetConsoleCP. Since no codepage is set when Java is executed within MinTTY terminal, the call fails and Java uses the charset returned by <code>Charset.defaultCharset()</code> as fallback. But in a Windows installation as describe above, <code>Charset.defaultCharset()</code> returns Cp-1252 while the default charset for consoles is Cp-850. The two codepages are not fully compatible. This leads to the strange output.
The long version:
Windows has two types of codepages: ANSI and OEM codepages. The first type is intended for UI applications that do not support Unicode and the later is used for console applications. Both types encode a single character in 1 Byte but they are not fully compatible.
Therefore on Windows Java has to deal with two charsets instead of one:
- <code>Charset.defaultCharset()</code> returns the ANSI codepage (usually cp-1252). This charset is specified by the file.encoding system property. If not specified as VM argument, the java executable determines the ANSI codepage and adds the system property during initialization. <code>String.getBytes()</code> uses the charset returned by <code>Charset.defaultCharset()</code>.
- <code>System.out</code> uses the OEM codepage for consoles (usually cp-850). The java executable gets this codepage by calling the GetConsoleCP function and sets the it as value for the internal system properties, sun.stdout.encoding and sun.stdout.encoding. When the call to GetConsoleCP fails the charset returned by <code>Charset.defaultCharset()</code> is used. This only happens when the console in which java.exe is executed hasn't set the OEM codepage before, by calling SetConsoleCP
So what happens now in the setup mentioned above?
$ javac MainDefault.java
$ java MainDefault
The native call of GetConsoleCP fails due to the bug in MSYS2. Therefore <code>System.out</code> falls back to the charset returned by <code>Charset.defaultCharset()</code> which is cp-1252. But the OEM codepage of the console is cp-850. Therefore System.out.println("²³") produces unexpected output.
The source code is stored in UTF-8. Encoding "²³" in UTF-8 requires 4 Bytes. But due to the missing -encoding parameter javac assumes default encoding that uses one byte per character. Therefore it interprets the 4 Bytes as 4 characters. <code>String.getBytes</code> uses the 1-Byte, based ANSI code page, cp-1252 and therefore returns 4 bytes.
$ javac -encoding UTF8 MainDefault.java
$ java MainDefault
With the -encoding UTF8 parameter javac interprets the UTF-8 encoded source as UTF-8. So the 4 bytes of "²³" are correclty recognized as two characters. <code>System.out</code> encodes the two characters in cp-1252 which leads to 2 bytes. But since the console still uses cp-850 the output is still corrupted. <code>String.getBytes</code> encodes the wo characters also in cp-1252 which leads to 2 bytes.
$ java -Dfile.encoding=UTF8 MainDefault
The system property, file.encoding overrides the charset returned by <code>Charset.defaultCharset()</code> that is also used by <code>String.getBytes()</code>. The two characters which were first wrongly interpreted by javac as 4 characters in 8-Bit encoding are now correclty encoded in UTF-8 as two characters encoded in two bytes per character. This leads to 4 bytes. Since file.encoding does not have any effect on the charset that is used by <code>System.out</code> the 4 (and not 2, due the wrong interpretation of javac) characters are still encoded in cp-1252, the console still uses cp-850 and you get still a corrupted output.
Your console can print ²³ since the console's 8-Bit OEM code page (cp-850) supports both characters. But it encodes it slightly different than the ANSI code page cp-1252 that is used by <code>System.out</code>
答案4
得分: 4
以下是翻译好的内容:
十六进制代码在UTF-8下看起来还不错。也许你的Git Bash字符集不是UTF-8。对我来说,它看起来是这样的:
然后控制台输出也看起来很好:
更新于2020-09-13: 这里有证据证明 chcp.com <codepage>
在Git Bash(mintty)中不起作用。它根本没有任何效果。你确实必须在mintty设置对话框中选择正确的代码页。
更新于2020-09-15: 好的,在阅读了@rmunge的答案后,我升级到了Git 2.28,并且能够重现原帖作者的问题,并且也使用了chcp
的变通方法(在我这种情况下,它的工作方式与@rmunge描述的不同)。因为Git(或者分别是MSYS2)在最新版本中存在如此多的错误,而且我不希望每次打开新控制台时都要在Git Bash中使用chcp.com
,所以我只是降级到了我之前使用了3年且没有任何问题的2.15.1版本。也许有更后面的版本没有这个控制台错误,我没有尝试,只是使用我计算机上下载文件夹中的旧安装程序。我建议每个人都这样做,现在绕过这个讨厌的错误。使用一个没有错误的控制台版本,它就像我描述的那样工作。
英文:
The hex codes look okay for UTF-8. Maybe your character set for Git Bash is not UTF-8. For me it looks like this:
The console output then also looks fine:
Update 2020-09-13: Here is proof that chcp.com <codepage>
does not work in Git Bash (mintty). It has no effect whatsoever. You really do have to select the correct codepage in the mintty settings dialogue.
Update 2020-09-15: Okay, after I read @rmunge's answer I upgraded to Git 2.28 and could reproduce the OP's problem and also use the chcp
workaround (it did not work as described by @rmunge in my case). Because Git (or MSYS2, respectively) are so buggy in the latest versions and I don't wish to use chcp.com
from inside Git Bash every time I open a new console, I just downgraded to version 2.15.1 which I had used for 3 years without any problems before. Maybe there are later versions without the console bug, I did not try but just use my old installer from the downloads folder on my computer. I recommend everyone to do the same and now work around this ugly bug. With a non-buggy console version, it just works like I described.
答案5
得分: 1
在Windows系统中,与你的代码页有关。
你可以使用命令chcp来设置你想要的代码页(例如:如果你想为特定的程序启动设置代码页),或者你可以在Java命令行中指定与代码页相对应的字符集。
如果当前的代码页不支持你要打印的字符,你会在控制台看到乱码。
不同的命令行可能表现不同的原因是由于默认加载的代码页/字符集不同。
请查看这个stackoverflow的帖子,了解具体操作方法:
https://stackoverflow.com/questions/14030811/system-out-character-encoding
英文:
On Windows, it has to do with your code page.
You can use the command chcp to set the code page you want (for eg: if you want to set it up for a specific program launched) or you can specify the charset corresponding to the codepage in the java commanline.
If the current codepage does not support the characters you are printing, you will see garbage in the console.
The reason why different shells may behave differently is due to the codepage/charsets that are loaded by default.
Please check out this SO post for how it is done:
https://stackoverflow.com/questions/14030811/system-out-character-encoding
答案6
得分: 1
我在 Windows 的 Git Bash 中遇到了相同的问题。java
和 javac
不能正确地显示中文字符。将 git-bash 的字符集设置为 UTF8 也无法解决问题。chcp
命令也无效。从 Git Bash 的安装向导中,我已经知道像 python
这样的程序在没有 winpty
的情况下无法正常工作。我在 ~/.bashrc
中添加了别名 alias python='winpty python'
。因此,我尝试了 winpty java Foo.java
和 winpty javac Foo.java
,幸运的是问题解决了。我将这些别名添加到了 ~/.bashrc
以解决问题:
alias java='winpty java'
alias javac='winpty javac'
最近的 Git Bash for Windows 版本(v2.2x)已经包含了一个关于 winpty
的实验性功能,但似乎仍然存在一些问题,所以我迄今为止仍然保留了这些别名。
英文:
I encountered the same problem in git bash for Windows. java
and javac
cannot print Chinese characters properly. Setting git-bash's character set as UTF8 does not help. chcp
does not work either. From git bash's installation wizard, I had known that programs like python
do not work properly without winpty
. I had added alias python='winpty python
to ~/.bashrc
. So I tried winpty java Foo.java
and winpty javac Foo.java
, and luckily the problem was gone. I added the aliases to ~/.bashrc
to fix the problem:
alias java='winpty java'
alias javac='wintpy javac'
The recent versions(v2.2x) of git bash for Windows have included an experimental feature about winpty
, but it seems it still has some problems, so I've kept these aliases so far.
答案7
得分: 0
Hex C2B2 C2B3
, when interpreted as UTF-8 is ²³
.
I assume you are using a Windows "cmd terminal"?
The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console
英文:
Hex C2B2 C2B3
, when interpreted as UTF-8 is ²³
.
I assume you are using a Windows "cmd terminal"?
The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console
答案8
得分: 0
Please verify that your Windows 10 installation does not have Unicode UTF-8 support enabled. You can see this option by going to Settings and then: All Settings -> Time & Language -> Language -> "Administrative Language Settings"
This is what it looks like - the feature should be unchecked.
Rationale:
<code>"²³".getBytes()</code> returns the encoding of the string, based on the detected default charset. On a Windows 10 system, the default charset should usually be a 1-Byte based encoding, independent of whether you launch java.exe
from a Windows console or from Git Bash. But your first screenshot shows a 4-Byte encoding that is actually UTF-8. So your JVM seems to detect UTF-8 as the wrong default charset that is incompatible with the codepage of your console.
Your console can print ²³ because both characters are supported by the used code page, but the encoding is based on one byte per character, while UTF-8 encoding requires 2 Bytes for each of these two characters.
I have no simple explanation for your second screenshot, but be aware that Git Bash is based on MSYS2, which again uses mintty terminal emulator. While MSYS2 uses UTF-8, and mintty also seems to support UTF-8, the whole thing is wrapped within a Windows console that is based on an OEM codepage that is incompatible with UTF-8. The whole setup then runs on an operating system that internally uses UTF-16. Now combined with a beta setting that overrides the entire OEM codebase concept on the OS level, this setup provides enough complexity for some incomprehensible behavior.
英文:
Please verify that your Windows 10 installation does not have Unicode UTF-8 support enabled. You can see this option by going to Settings and then: All Settings -> Time & Language -> Language -> "Administrative Language Settings"
This is what it looks like - the feature should be unchecked.
Rationale:
<code>"²³".getBytes()</code> returns the encoding of the string, based on the detected default charset. On a Windows 10 system the default charset should usually be a 1-Byte based encoding, independent from whether you launch java.exe from a Windows console or from Git Bash. But your first screenshot shows a 4-Byte encoding that is actually UTF-8. So your JVM seems to detect UTF-8 as the wrong default charset that is incompatible with the codepage of your console.
Your console can print ²³ because both characters are supported by the used code page, but the encoding is based on one byte per character while UTF-8 encoding requires 2 Bytes for each of these two characters.
I have no simple explanation for your second screenshot but be aware that Git Bash is based on MSYS2 which again uses mintty terminal emulator. While MSYS2 uses UTF-8, and mintty also seems to support UTF-8 the whole thing is wrapped within a Windows console that is based on an OEM codepage that is incompatible to UTF-8. The whole thing then runs on an operating system that internally uses UTF-16. Now combined with a beta setting that overrules the whole OEM codebase concept on OS-level this setup provides enough complexity for some incomprehensible behavior.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论