从包含无法映射字符的文件中读取

huangapple go评论78阅读模式
英文:

Reading from a file containing unmappable characters

问题

我正在尝试使用File和Scanner来读取一个.txt文件,并提取其中的有用信息到另一个文件中。其中一些文件包含中文字符,导致我的Scanner抛出以下错误"java.nio.charset.UnmappableCharacterException:"。这些中文字符对我来说不重要,那么如何让Scanner忽略这些中文字符,继续搜索文件中的其余有用信息?

以下是代码:

try {
    File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
    Scanner reader = new Scanner(source);
    StringBuilder str = new StringBuilder();
    while (reader.hasNextLine()) {
        str.append(reader.nextLine());
        str.append("\n");
    }
    if (reader.ioException() != null) {
        throw reader.ioException();
    }
    reader.close();
    this.input.setText(str.toString());
} catch (FileNotFoundException e1) {
    JOptionPane.showMessageDialog(this, "File not found!");
    return;
} catch (IOException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
}
英文:

I am attempting to use File and Scanner to read through a .txt file and grab the useful information within into a separate file. Some of these files contain Chinese characters and its causing my Scanner to throw the following error "java.nio.charset.UnmappableCharacterException:". The Chinese characters are of no importance, so how do I make the scanner ignore the Chinese characters and keep searching the rest of the file for useful information?

Here is the code:

            try {
				File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
				Scanner reader = new Scanner(source);
				StringBuilder str = new StringBuilder();
				while (reader.hasNextLine()) {
					str.append(reader.nextLine());
					str.append("\n");
				}
				if (reader.ioException() != null) {
				    throw reader.ioException();
				}
				reader.close();
				this.input.setText(str.toString());
			} catch (FileNotFoundException e1) {
				JOptionPane.showMessageDialog(this, "File not found!");
				return;
			} catch (IOException e1) {
				// TODO Auto-generated catch block
				e1.printStackTrace();
			}

答案1

得分: 0

一个扫描器会隐式地在外部字节序列与Java字符串中使用的16位Unicode字符之间进行转换。

您需要了解外部数据(例如文件内容)所使用的实际编码。然后,您可以声明您的扫描器如下:

Scanner reader = new Scanner(file, charset);

正确完成这一步骤后,就不应该出现“无法映射”的字符。

如果您没有显式指定字符集,则会使用平台默认字符集,通常为UTF-8。

或者,看起来您并没有真正大量使用扫描器;您只是在使用它来收集行。您可以降低一级,使用FileInputStream来读取文件的字节序列,并使用适当的启发式方法确定文件的“有用”部分。

英文:

A scanner implicitly converts between an external sequence of bytes, and the 16-bit Unicode characters used by all Java Strings.

You need to know the actual encoding used for the external data (i.e., the file content). Then you declare your Scanner as

  Scanner reader = new Scanner(file, charset);

Having done that correctly, then there should be no 'unmappable' characters.

If you don't specify the charset explicitly, then the platform default is used, which is probably UTF-8.

Alternatively, it seems that you're not really using the Scanner to any significant degree; you're just using it to collect lines. You could drop down a level and use a FileInputStream to read the file as a sequence of bytes, and use whatever heuristics you think appropriate to determine the 'useful' parts of the file.

huangapple
  • 本文由 发表于 2020年9月1日 09:24:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/63680102.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定