英文:
Change all txt files decode from ANSI to UTF-8 in folder and subfolders
问题
我尝试了几种方法来完成这个任务,但很不幸,它们都没有奏效。
主要问题是获取所有 txt 文件的路径,我用以下代码来实现:
```java
public static List<String> getFileList() throws IOException {
int depth = 5;
String dir = "pathToMainFolder";
Stream<Path> stream = Files.walk(Paths.get(dir), depth);
List<String> paths = new ArrayList<>();
try {
stream.filter(file -> !Files.isDirectory(file))
.map(Path::toString)
.filter(file -> file.endsWith("txt"))
.collect(Collectors.toCollection(() -> paths));
} catch (Exception e) {
e.printStackTrace();
}
return paths;
}
之后,我尝试了以下代码来更改文件的解码方式,参考自:https://stackoverflow.com/questions/18141162/how-to-convert-ansi-to-utf8-in-java:
for (String path : paths) {
Path p = Paths.get(path);
ByteBuffer bb = ByteBuffer.wrap(Files.readAllBytes(p));
CharBuffer cb = Charset.forName("Cp1252").decode(bb);
bb = Charset.forName("UTF-8").encode(cb);
Files.write(p, bb.array());
}
这将文件转换为 UTF-8 编码,但文件内容与我的预期相差很远。例如:预期应该是 tań
,但实际是 tañ
;应该是 choć
,但实际是 choæ
。
我还尝试了使用 BufferedReader
和 BufferedWriter
创建新文件,尝试在解码更改后替换所有符号。唯一起作用的是 Normalizer
:
String everything = "";
BufferedReader br = new BufferedReader(new FileReader(path));
try {
StringBuilder sb = new StringBuilder();
String line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
while (line != null) {
System.out.println(line);
sb.append(line);
sb.append(System.lineSeparator());
line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
}
everything = sb.toString();
System.out.println(everything);
} finally {
br.close();
}
但这是我将在没有其他解决方案的情况下尝试的最后一招。我还要提到,在文件夹和子文件夹中有 14k+ 个文件需要更改(文件不是很长,平均每个文件 487 行,每行只有少量字符)。
对于这个问题,是否有任何方法或解决方案呢?
<details>
<summary>英文:</summary>
I`ve tried few approaches to complete this task, unfortunately none of them is working.
Main thing is get all txt files path, so i do it with this piece of code:
public static List<String> getFileList() throws IOException {
int depth = 5;
String dir = "pathToMainFolder";
Stream<Path> stream = Files.walk(Paths.get(dir), depth);
List<String> paths = new ArrayList<>();
try {
stream.filter(file -> !Files.isDirectory(file))
.map(Path::toString)
.filter(file -> file.endsWith("txt"))
.collect(Collectors.toCollection(() -> paths));
} catch (Exception e) {
e.printStackTrace();
}
return paths;
}
After that i tried changing decoding of file with this piece of code from
<https://stackoverflow.com/questions/18141162/how-to-convert-ansi-to-utf8-in-java>:
for (String path : paths) {
Path p = Paths.get(path);
ByteBuffer bb = ByteBuffer.wrap(Files.readAllBytes(p));
CharBuffer cb = Charset.forName("Cp1252").decode(bb);
bb = Charset.forName("UTF-8").encode(cb);
Files.write(p, bb.array());
}
It changed files to UTF-8 coding, but file content is far different from my expectations. For example: is should be: `tań` but is is `tañ`, should be `choć` but it is `choæ`.
I tried also creating new files with `BufferedReader` and `BufferedWriter`, i was trying replacing all signs after decode change. Only thing that works is `Normalizer`:
String everything = "";
BufferedReader br = new BufferedReader(new FileReader(path));
try {
StringBuilder sb = new StringBuilder();
String line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
while (line != null) {
System.out.println(line);
sb.append(line);
sb.append(System.lineSeparator());
line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
}
everything = sb.toString();
System.out.println(everything);
} finally {
br.close();
}
But it is the last thing i will do after there is no solution. I will also mention that there are 14k+ files to change in folder and subfolders (files are not so long avg 487 lines with few chars in each).
Any approach or solution for this problem?
</details>
# 答案1
**得分**: 1
以下是您提供的代码片段的翻译部分:
转换文件的字符集之间无需使用字符/字节缓冲区,以下是使用字符串和 getBytes 进行重新编码的简单调用示例:
private static void recode(Path input, Charset inCharset, Path output, Charset outCharset)
{
try
{
Files.createDirectories(output.getParent());
Files.write(output, new String(Files.readAllBytes(input), inCharset).getBytes(outCharset));
}
catch (IOException e)
{
throw new UncheckedIOException(e);
}
}
您需要注意仅在目录上运行一次,对于测试,最好在单独的输入/输出目录中进行构建。您的 main 方法可以使用 Files.find
进行简化,并直接处理转换:
public static void main(String[] args) throws IOException
{
Charset outCharset = StandardCharsets.UTF_8;
// 根据需要更改
Charset inCharset = Charset.forName("Cp1252");
// 或者
// Charset inCharset = StandardCharsets.XYZ;
// 或者
// Charset inCharset = Charset.forName(System.getProperty("file.encoding"));
int depth = 5;
Path dir = Path.of("主文件夹路径");
Path outdir = Path.of("主文件夹路径.utf8");
try (Stream<Path> stream = Files.find(dir, depth, (p,a) -> a.isRegularFile() && p.getFileName().toString().endsWith(".txt")))
{
stream.forEach(p -> recode(p, inCharset, outdir.resolve(dir.relativize(p)), outCharset));
}
}
<details>
<summary>英文:</summary>
Converting file between two character sets does not need to use Char/Byte Buffer, here is simple call to recode using String and getBytes:
private static void recode(Path input, Charset inCharset, Path output, Charset outCharset)
{
try
{
Files.createDirectories(output.getParent());
Files.write(output, new String(Files.readAllBytes(input), inCharset).getBytes(outCharset));
}
catch (IOException e)
{
throw new UncheckedIOException(e);
}
}
You'll need to take care to only run once on your directories, for tests it would be safer to build separate IN/OUT dirs. Your main can be simplified with `Files.find` and process the transformation directly:
public static void main(String[] args) throws IOException
{
Charset outCharset = StandardCharsets.UTF_8;
// Change as required
Charset inCharset = Charset.forName("Cp1252");
// OR
// Charset inCharset = StandardCharsets.XYZ;
// OR
// Charset inCharset = Charset.forName(System.getProperty("file.encoding"));
int depth = 5;
Path dir = Path.of("pathToMainFolder");
Path outdir = Path.of("pathToMainFolder.utf8");
try (Stream<Path> stream = Files.find(dir, depth, (p,a) -> a.isRegularFile() && p.getFileName().toString().endsWith(".txt")))
{
stream.forEach(p -> recode(p, inCharset, outdir.resolve(dir.relativize(p)), outCharset));
}
}
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论