英文:
Reading from a file containing unmappable characters
问题
我正在尝试使用File和Scanner来读取一个.txt文件,并提取其中的有用信息到另一个文件中。其中一些文件包含中文字符,导致我的Scanner抛出以下错误"java.nio.charset.UnmappableCharacterException:"。这些中文字符对我来说不重要,那么如何让Scanner忽略这些中文字符,继续搜索文件中的其余有用信息?
以下是代码:
try {
File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
Scanner reader = new Scanner(source);
StringBuilder str = new StringBuilder();
while (reader.hasNextLine()) {
str.append(reader.nextLine());
str.append("\n");
}
if (reader.ioException() != null) {
throw reader.ioException();
}
reader.close();
this.input.setText(str.toString());
} catch (FileNotFoundException e1) {
JOptionPane.showMessageDialog(this, "File not found!");
return;
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
英文:
I am attempting to use File and Scanner to read through a .txt file and grab the useful information within into a separate file. Some of these files contain Chinese characters and its causing my Scanner to throw the following error "java.nio.charset.UnmappableCharacterException:". The Chinese characters are of no importance, so how do I make the scanner ignore the Chinese characters and keep searching the rest of the file for useful information?
Here is the code:
try {
File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
Scanner reader = new Scanner(source);
StringBuilder str = new StringBuilder();
while (reader.hasNextLine()) {
str.append(reader.nextLine());
str.append("\n");
}
if (reader.ioException() != null) {
throw reader.ioException();
}
reader.close();
this.input.setText(str.toString());
} catch (FileNotFoundException e1) {
JOptionPane.showMessageDialog(this, "File not found!");
return;
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
答案1
得分: 0
一个扫描器会隐式地在外部字节序列与Java字符串中使用的16位Unicode字符之间进行转换。
您需要了解外部数据(例如文件内容)所使用的实际编码。然后,您可以声明您的扫描器如下:
Scanner reader = new Scanner(file, charset);
正确完成这一步骤后,就不应该出现“无法映射”的字符。
如果您没有显式指定字符集,则会使用平台默认字符集,通常为UTF-8。
或者,看起来您并没有真正大量使用扫描器;您只是在使用它来收集行。您可以降低一级,使用FileInputStream来读取文件的字节序列,并使用适当的启发式方法确定文件的“有用”部分。
英文:
A scanner implicitly converts between an external sequence of bytes, and the 16-bit Unicode characters used by all Java Strings.
You need to know the actual encoding used for the external data (i.e., the file content). Then you declare your Scanner as
Scanner reader = new Scanner(file, charset);
Having done that correctly, then there should be no 'unmappable' characters.
If you don't specify the charset explicitly, then the platform default is used, which is probably UTF-8.
Alternatively, it seems that you're not really using the Scanner to any significant degree; you're just using it to collect lines. You could drop down a level and use a FileInputStream to read the file as a sequence of bytes, and use whatever heuristics you think appropriate to determine the 'useful' parts of the file.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论