2020年9月1日 09:24:49go评论102阅读模式

英文:

Reading from a file containing unmappable characters

问题

我正在尝试使用File和Scanner来读取一个.txt文件，并提取其中的有用信息到另一个文件中。其中一些文件包含中文字符，导致我的Scanner抛出以下错误"java.nio.charset.UnmappableCharacterException:"。这些中文字符对我来说不重要，那么如何让Scanner忽略这些中文字符，继续搜索文件中的其余有用信息？

以下是代码：

try {
    File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
    Scanner reader = new Scanner(source);
    StringBuilder str = new StringBuilder();
    while (reader.hasNextLine()) {
        str.append(reader.nextLine());
        str.append("\n");
    }
    if (reader.ioException() != null) {
        throw reader.ioException();
    }
    reader.close();
    this.input.setText(str.toString());
} catch (FileNotFoundException e1) {
    JOptionPane.showMessageDialog(this, "File not found!");
    return;
} catch (IOException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
}

英文:

I am attempting to use File and Scanner to read through a .txt file and grab the useful information within into a separate file. Some of these files contain Chinese characters and its causing my Scanner to throw the following error "java.nio.charset.UnmappableCharacterException:". The Chinese characters are of no importance, so how do I make the scanner ignore the Chinese characters and keep searching the rest of the file for useful information?

Here is the code:

            try {
				File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
				Scanner reader = new Scanner(source);
				StringBuilder str = new StringBuilder();
				while (reader.hasNextLine()) {
					str.append(reader.nextLine());
					str.append(&quot;\n&quot;);
				}
				if (reader.ioException() != null) {
				    throw reader.ioException();
				}
				reader.close();
				this.input.setText(str.toString());
			} catch (FileNotFoundException e1) {
				JOptionPane.showMessageDialog(this, &quot;File not found!&quot;);
				return;
			} catch (IOException e1) {
				// TODO Auto-generated catch block
				e1.printStackTrace();
			}

答案1

得分: 0

一个扫描器会隐式地在外部字节序列与Java字符串中使用的16位Unicode字符之间进行转换。

您需要了解外部数据（例如文件内容）所使用的实际编码。然后，您可以声明您的扫描器如下：

Scanner reader = new Scanner(file, charset);

正确完成这一步骤后，就不应该出现“无法映射”的字符。

如果您没有显式指定字符集，则会使用平台默认字符集，通常为UTF-8。

或者，看起来您并没有真正大量使用扫描器；您只是在使用它来收集行。您可以降低一级，使用FileInputStream来读取文件的字节序列，并使用适当的启发式方法确定文件的“有用”部分。

英文:

A scanner implicitly converts between an external sequence of bytes, and the 16-bit Unicode characters used by all Java Strings.

You need to know the actual encoding used for the external data (i.e., the file content). Then you declare your Scanner as

  Scanner reader = new Scanner(file, charset);

Having done that correctly, then there should be no 'unmappable' characters.

If you don't specify the charset explicitly, then the platform default is used, which is probably UTF-8.

Alternatively, it seems that you're not really using the Scanner to any significant degree; you're just using it to collect lines. You could drop down a level and use a FileInputStream to read the file as a sequence of bytes, and use whatever heuristics you think appropriate to determine the 'useful' parts of the file.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从包含无法映射字符的文件中读取

问题

答案1

supportFragmentManager.commit在Kotlin中不起作用。

从Notepad++复制粘贴到NetBeans

图形数据库用户/密码代理设置导致状态码 407。

如何在 RxJava2 中停止观察一个 Observable 而不将其处置？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。