Java incorrectly reading accented characters from System.in

huangapple go评论57阅读模式
英文:

Java incorrectly reading accented characters from System.in

问题

如果你遇到相同的问题,并且你的字符集包含在ANSI测试编码(代码页1252或"ISO 8859-1")中,你可以暂时使用该编码来规避UTF-8的问题,不过UTF-8是包括了每种脚本以实现最终本地化的现代标准。

我正在创建一个应用程序,需要从控制台读取包含重音字符的用户输入。根据我在线阅读的信息,现代控制台能够处理重音字符的输出,并正确编码输入,尽管它们在发送命令之前显示为?

PS C:\> echo ?
ü
Ps C:\>

注意:这种行为在命令提示符中无法复制。在Windows终端中运行的命令提示符似乎在发送之前正确显示重音字符。

然而,在运行以下测试代码时:

package com.test.outputtest;

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import java.nio.file.*;

public class OutputTest {

    public static void main(String[] args) {
        // 设置I/O使用UTF-8
        System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8));

        // 创建响应监听器
        Scanner input = new Scanner(System.in, StandardCharsets.UTF_8);

        System.out.println(Arrays.toString("èéëê".getBytes(StandardCharsets.UTF_8)));
        String temp = input.nextLine();
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_8)));
    }

}

这是输出(构建"app.jar"后):

PS C:\Users\[name]\Desktop\output_test> chcp 65001
Active code page: 65001
PS C:\Users\[name]\Desktop\output_test> java "-Dfile.encoding=UTF-8" -jar app.jar
[-61, -88, -61, -87, -61, -85, -61, -86]
èéëê
[0, 0, 0, 0]

第一个字节数组来自预先编写的字符串,第二个数组是输入字符串的字节。echo正确输出重音字符的事实使我相信这是编译器错误,但我不确定如何修复它。我已尝试将Scanner替换为Console,但结果相同。

在IntelliJ内运行时,当在终端中输入ü 时,它会被完全正常地读取。这也是我怀疑编译过程中存在问题的原因之一。在命令提示符而不是PowerShell中运行时,会出现相同的错误。

注意:我正在使用运行PowerShell的Windows终端,并使用IntelliJ Idea Community Edition 2021.3。我除了构建文件路径和其他特定于项目的文件路径之外,没有编辑.xml文件。

  • 操作系统:Windows 10版本19045.2728
  • Java版本:17.0.6(在IntelliJ中也是如此)
  • 默认代码页:850(OEM)
  • 出现错误的代码页:65001(UTF-8)
英文:

If you are facing the same problem, and your character set is covered by the ANSI test encoding (codepage 1252 or "ISO 8859-1"), you could use that encoding instead to temporarily circumvent the problem with UTF-8, however UTF-8 is the modern standard that encompasses every script for ultimate localisation.

I'm creating an application that has to read user input containing accented characters from the console. From what I've read online, modern consoles are capable of handling accented character outputs, and correctly encoding inputs, even though they show up as ? before sending the command.

PS C:\> echo ?
ü
Ps C:\>

Note: this behaviour is not reproducible in Command Prompt. Command Prompt, when run in Windows Terminal, seems to display accented characters correctly before sending as well.

However, when running the following test code:

package com.test.outputtest;

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import java.nio.file.*;

public class OutputTest {

    public static void main(String[] args) {
        // Set I/O to use UTF-8
        System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8));

        // Create the response listener
        Scanner input = new Scanner(System.in, StandardCharsets.UTF_8);

        System.out.println(Arrays.toString("èéëê".getBytes(StandardCharsets.UTF_8)));
        String temp = input.nextLine();
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_8)));
    }

}

this is the output (after building the artifact "app.jar"):

PS C:\Users\[name]\Desktop\output_test> chcp 65001
Active code page: 65001
PS C:\Users\[name]\Desktop\output_test> java "-Dfile.encoding=UTF-8" -jar app.jar
[-61, -88, -61, -87, -61, -85, -61, -86]
èéëê
[0, 0, 0, 0]

The first array of bytes comes from the pre-written string, the second array is the bytes of the inputted string. The fact that echo outputs accents correctly leads me to believe that this is a compiler error, but I'm not sure how to fix it. I've tried replacing the Scanner with Console, that gave me the same error.

When running inside of IntelliJ, the ü is read completely normally when inputting it in the terminal. This is also a reason why I suspect a problem during compilation.
When running with command prompt instead of PowerShell, the same error occurs.

Note: I'm using Windows Terminal running PowerShell and using IntelliJ Idea Community Edition 2021.3. I have not edited the .xml files besides the artifact building file path and some other project-specific file paths.

  • OS: Windows 10 build 19045.2728
  • Java version: 17.0.6 (Also in IntelliJ)
  • Default codepage: 850 (OEM)
  • Codepage used in which the error occured: 65001 (UTF-8)

答案1

得分: 1

我能重现你的问题,但我认为你的代码没有问题,也没有简单的解决方案。令人难以置信的是,即使在最新版本的Java(18、19、20)中,从Windows控制台读取UTF-8字符仍然存在问题。

这在JDK bug JDK-8295672 提供更好的替代方法来读取System.in中得到了正式记录,而且该bug尚未解决。其中明确说明(我添加了强调部分):

读取 System.in 存在问题,因为它是以主机的编码为基础的输入流。在JEP 400中,存在默认编码(UTF-8)和主机的本地编码不同的情况。为了正确读取字节,用户需要将本地字节转换为默认字节,这似乎对基本使用构成了障碍。提供更好的访问方式(不考虑编码问题)将是适当的。

因此,将默认字符集设置为UTF-8并不能解决问题,因为“主机的本地编码” 不是UTF-8,至少在Windows的 cmd.exePowerShell 中是如此。

注意事项:

在标准Java API中标准化使用UTF-8,除了控制台I/O

  • 可能由于上述提到的“主机的编码”问题,控制台I/O在JEP400中被排除在外。
  • 你的代码在IntelliJ中运行时为什么正常工作呢?我猜测这是因为JetBrains使用JNA从他们的控制台中读取输入,但这只是猜测。
英文:

I can reproduce your problem, but I see nothing wrong with your code and I have no easy solution. Incredibly, it seems that even with the most recent versions of Java (18, 19, 20), reading UTF-8 characters from a Windows console remains problematic.

This is formally documented in JDK bug JDK-8295672 Provide a better alternative to reading System.in which is open and unresolved. It states (with my emphasis added):

> Reading System.in is problematic as it is an input stream encoded in
> the host's encoding. With the JEP 400, there are cases where the
> default encoding (UTF-8) and host's native encoding differ
. To read
> the bytes correctly, users would have to convert the bytes
> native-to-default, which seems to be an obstacle for basic usage.
> Providing a better means to access (w/o considering encoding stuff)
> would be appropriate.

So setting the default charset to UTF-8 does not resolve the issue because the "host's native encoding" is not UTF-8, and there is nothing you can do about that (at least with respect to cmd.exe and PowerShell on Windows).

Notes:

> Standardize on UTF-8 throughout the standard Java APIs, except for
> console I/O
.

  • Presumably console I/O was excluded in JEP400 because of the "host's encoding" issue mentioned above.
  • An obvious question arising is why does your code work when run within Intellij? I suspect that is because JetBrains uses JNA to read the input from their console, but that's just a guess.

huangapple
  • 本文由 发表于 2023年4月4日 18:11:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75928132.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定