Java – 无法读取外文字符

huangapple go评论77阅读模式
英文:

Java - Unable to read foreign characters

问题

我以前成功地使用过ISO8859-13字符编码,但这次似乎不起作用。

根据网站https://en.wikipedia.org/wiki/ISO/IEC_8859-13 ,这是一个有效的字符集。

这些是文件中存储的3个字符。
äää

以下是使用的代码。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;

public class ReadFile
{
    public static void main(String[] arguments)
    {
        try
        {
            File inFile = new File("C:\\Downloads\\MyFile.txt");
            if (inFile.exists())
            {
                System.out.println("文件已找到");
                BufferedReader in = new BufferedReader(
                        new InputStreamReader(new FileInputStream(inFile), "ISO8859-13"));

                String line = null;

                while ( (line = in.readLine()) != null )
                {
                    System.out.println("读取的行: >" + line + "<");
                }
            }
            else
            {
                System.out.println("未找到文件");
            }
        }
        catch (IOException e)
        {
        }

    }
}

在Windows和Linux上,使用或不使用Eclipse,输出结果相同,如下所示。

 读取的行: >&gt;?&#164;?&#164;?&#164;&lt;

之前这对许多其他字符有效,但为什么对这个字符无效呢?

英文:

I have successfully used the ISO8859-13 character encoding before but this time it doesn't seem to be working.

Based on the web site https://en.wikipedia.org/wiki/ISO/IEC_8859-13 it is a valid character.

Java – 无法读取外文字符

These are the 3 characters stored in the file.
äää

Here is the code being used.

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;

public class ReadFile
{
    public static void main(String[] arguments)
    {
        try
        {
            File inFile = new File(&quot;C:\\Downloads\\MyFile.txt&quot;);
            if (inFile.exists())
            {
                System.out.println(&quot;File found&quot;);
                BufferedReader in = new BufferedReader(
                        new InputStreamReader(new FileInputStream(inFile), &quot;ISO8859-13&quot;));

                String line = null;

                while ( (line = in.readLine()) != null )
                {
                    System.out.println(&quot;Line Read: &gt;&quot; + line + &quot;&lt;&quot;);
                }
            }
            else
            {
                System.out.println(&quot;File not found&quot;);
            }
        }
        catch (IOException e)
        {
        }

    }
}

The output on both Windows and Linux with and/or without Eclipse is the same which is.

 Line Read: &gt;?&#164;?&#164;?&#164;&lt;

This previously worked for a number of other characters but why not for this?

答案1

得分: 3

有很多可能的解释可以解释你所观察到的情况。其中最有可能的两种情况,以及一些代码,你可以用它来确认你已经找到了问题的原因:

选项#1:终端问题

也许你正在将这个输出写入一个无法渲染 ä 的终端,或者存在终端传输问题(终端最终只是一堆连接在一起的流和管道,它们在内部是字节,所以如果流程的某个部分认为所有字节都将被解释为 UTF-8 编码的文本,另一个部分则认为是 ISO-8859-13 编码,就会出现问题)。鉴于你在 Windows 和 Linux 上看到完全相同的输出,这是不太可能的(如果你在 IDE 的“控制台”视图中看到这个问题,或者在不同系统上看到相同代码的不同输出,那么这种情况可能性会更大)。如果你想测试一下,可以运行这段代码代替:System.out.println("第一个字符的 Unicode 代码点:" + (int) line.charAt(0)); - 这应该会输出 228,这是 ä 的 Unicode 代码点。如果不是这个结果,那么你可以确定这不是(唯一的)问题。

如果是这个问题,解决方法就是使用另一个终端,或者调整设置。你可以在 Stack Overflow 提一个新问题,并详细说明你的设置(操作系统是什么,终端客户端是什么,SET 命令输出了什么,客户端是否有编码选项等等)。

选项#2:实际上不是 ISO-8859-13 编码

这也很容易测试:将你的 BufferedReader in = .... 那行注释掉,然后用以下代码代替:System.out.println(new FileInputStream(file).read()); - 这应该会输出 228。如果输出了其他内容,那么你的输入文件实际上不是 ISO-8859-13 编码。

如果是这个问题,找出实际的编码方式,并使用它代替。例如,在 UTF-8 编码中,&#228; 在文件中会占用 2 个字节。这已经意味着你的输入文件只包含 &#228;&#228;&#228;,甚至后面没有换行符,文件大小就是 6 个字节(在 ISO-8859-13 中,它将是 3 个字节),而且你使用 fileInputStream.read() 读取它们的原始字节是按顺序的:195 164 195 164 195 164。因此,如果你运行上面的代码,输出的是 195 而不是 228 - 那么你的输入可能是 UTF-8 编码;它绝对不是 ISO-8859-13 编码。

英文:

There are many explanations possible for what you are observing. The two most likely ones, along with some code you can use to confirm that you've found the cause:

Option #1: Terminal issues

Maybe you are writing this to a terminal that either cannot render ä, or, there is a terminal transfer issue (terminals are, in the end, just a bunch of streams and pipes hooked together, they are bytes under the hood, so if one part of the process thinks all are agreed that all bytes are to be interpreted as UTF-8 encoded text, and another as ISO-8859-13 encoded, you get problems). Given that you see the exact same output on windows as on linux this is unlikely (it would be particularly likely if you are seeing this in the 'console' view in an IDE, or different outputs on different systems for the same code). If you want to test it, run instead: System.out.println(&quot;unicode codepoint of the first character: &quot; + (int) line.charAt(0)); - this should print 228, which is the unicode codepoint for ä. If it doesn't, then you can be certain this isn't the (only) problem.

If this is it, the fix is to, well, use another terminal or mess with settings, I'd just ask another SO question and give plenty of detail on your setup (which OS, which terminal client, what does SET print, does the client have encoding options, etcetera).

Option #2: It's not actually ISO-8859-13

This, too, is simple to test: remark out your BufferedReader in = .... line and replace it with: System.out.println(new FileInputStream(file).read()); - this should print 228. If it prints anything else, your input file is not actually ISO-8859-13.

If this is it, find out what the encoding actually is and use that instead. For example, in UTF-8 encoding, &#228; would end up as 2 bytes in a file. That would already imply that your input file containing just &#228;&#228;&#228; and not even a newline afterwards is 6 bytes large (in ISO-8859-13, it would be 3), and that the raw bytes, as you read them with fileInputStream.read(), are, in order: 195 164 195 164 195 164. So, if you run the above code and it prints 195 instead of 228 - your input is probably in UTF-8; it's definitely not in ISO-8859-13.

huangapple
  • 本文由 发表于 2020年10月22日 05:26:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/64471954.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定