英文:
Reading and writing file in ISO-8859-1 encoding?
问题
我有一个以ISO-8859-1编码的文件。我试图将其读取为一个单独的字符串,对其进行一些正则表达式替换操作,然后以相同的编码写回文件。
然而,我得到的结果文件似乎总是UTF-8编码(至少根据Notepad++的显示),损坏了一些字符。
有人能看出我在这里做错了什么吗?
英文:
I have file encoded in ISO-8859-1. I'm trying to read it in as a single String, do some regex substitutions on it, and write it back out in the same encoding.
However, the resulting file I get always seems to be UTF-8 (according to Notepad++ at least), mangling some characters.
Can anyone see what I'm doing wrong here?
private static void editFile(File source, File target) {
// Source and target encoding
Charset iso88591charset = Charset.forName("ISO-8859-1");
// Read the file as a single string
String fileContent = null;
try (Scanner scanner = new Scanner(source, iso88591charset)) {
fileContent = scanner.useDelimiter("\\Z").next();
} catch (IOException exception) {
LOGGER.error("Could not read input file as a single String.", exception);
return;
}
// Do some regex substitutions on the fileContent string
String newContent = regex(fileContent);
// Write the file back out in target encoding
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), iso88591charset))) {
writer.write(newContent);
} catch (Exception exception) {
LOGGER.error("Could not write out edited file!", exception);
}
}
答案1
得分: 2
以下是翻译好的内容:
你的代码实际上没有问题。Notepad++看到文件以UTF-8编码,因为从基本层面上讲,UTF-8和你尝试使用的编码之间没有区别。只有特定的字符不同,而且一些(很多)字符在ISO编码中缺失,与UTF相比。你可以在这里阅读更多信息,或者只需在Google中搜索“ISO-8859-1 vs UTF-8”。
我已经用你的代码创建了一个简单的项目,并且使用了对ISO编码不同的字符进行了测试 - 结果是IntelliJ(可能也适用于Notepad++ - 无法轻易检查,因为我在Linux上)将其识别为ISO-8859-1编码的文件。除此之外,我还添加了另一个使用了Files
类的新功能(JDK11)。你所使用的new Scanner(source, charset)
是在JDK10中添加的,所以我认为你可能已经在使用11了。下面是简化后的代码:
private static void editFile(File source, File target) {
Charset charset = StandardCharsets.ISO_8859_1;
String fileContent;
try {
fileContent = Files.readString(source.toPath(), charset);
} catch (IOException exception) {
System.err.println("无法将输入文件读取为单个字符串。");
exception.printStackTrace();
return;
}
String newContent = regex(fileContent);
try {
Files.writeString(target.toPath(), newContent, charset);
} catch (IOException exception) {
System.err.println("无法写入编辑后的文件!");
exception.printStackTrace();
}
}
请随意克隆存储库或在GitHub上查看它,并使用您喜欢的代码版本。
英文:
There is nothing actually wrong with your code. Notepad++ sees the file encoded in UTF-8 because on a basic level there is no difference between UTF-8 and the encoding you're trying to use. Only specific characters differ and some (a lot) are missing from ISO compared to UTF. You can read more here or by simply searching ISO-8859-1 vs UTF-8
in Google.
I've created a simple project with your code and tested it with characters that are different for the ISO encoding - the result is a file that IntelliJ (and probably Notepad++ as well - cannot easily check, I'm on Linux) recognizes as ISO-8859-1. Apart from that, I've added another class that makes use of new (JDK11) features from Files
class. The new Scanner(source, charset)
that you've used was added in JDK10, so I think that you may be using 11 already. Here's the simplified code:
private static void editFile(File source, File target) {
Charset charset = StandardCharsets.ISO_8859_1;
String fileContent;
try {
fileContent = Files.readString(source.toPath(), charset);
} catch (IOException exception) {
System.err.println("Could not read input file as a single String.");
exception.printStackTrace();
return;
}
String newContent = regex(fileContent);
try {
Files.writeString(target.toPath(), newContent, charset);
} catch (IOException exception) {
System.err.println("Could not write out edited file!");
exception.printStackTrace();
}
}
Feel free to clone the repository or check it on GitHub and use whichever code version you prefer.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论