将文件夹及其子文件夹中的所有txt文件从ANSI解码为UTF-8。

huangapple go评论71阅读模式
英文:

Change all txt files decode from ANSI to UTF-8 in folder and subfolders

问题

我尝试了几种方法来完成这个任务但很不幸它们都没有奏效
主要问题是获取所有 txt 文件的路径我用以下代码来实现

```java
public static List<String> getFileList() throws IOException {
    int depth = 5;
    String dir = "pathToMainFolder";
    Stream<Path> stream = Files.walk(Paths.get(dir), depth);
    List<String> paths = new ArrayList<>();
    try {
        stream.filter(file -> !Files.isDirectory(file))
                .map(Path::toString)
                .filter(file -> file.endsWith("txt"))
                .collect(Collectors.toCollection(() -> paths));
    } catch (Exception e) {
        e.printStackTrace();
    }
    return paths;
}

之后,我尝试了以下代码来更改文件的解码方式,参考自:https://stackoverflow.com/questions/18141162/how-to-convert-ansi-to-utf8-in-java

for (String path : paths) {
    Path p = Paths.get(path);
    ByteBuffer bb = ByteBuffer.wrap(Files.readAllBytes(p));
    CharBuffer cb = Charset.forName("Cp1252").decode(bb);
    bb = Charset.forName("UTF-8").encode(cb);
    Files.write(p, bb.array());
}

这将文件转换为 UTF-8 编码,但文件内容与我的预期相差很远。例如:预期应该是 tań,但实际是 ta&#241;;应该是 choć,但实际是 cho&#230;

我还尝试了使用 BufferedReaderBufferedWriter 创建新文件,尝试在解码更改后替换所有符号。唯一起作用的是 Normalizer

String everything = "";
BufferedReader br = new BufferedReader(new FileReader(path));
try {
    StringBuilder sb = new StringBuilder();
    String line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
            .replaceAll("[^\\p{ASCII}]", "");

    while (line != null) {
        System.out.println(line);
        sb.append(line);
        sb.append(System.lineSeparator());
        line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
                .replaceAll("[^\\p{ASCII}]", "");
    }
    everything = sb.toString();
    System.out.println(everything);
} finally {
    br.close();
}

但这是我将在没有其他解决方案的情况下尝试的最后一招。我还要提到,在文件夹和子文件夹中有 14k+ 个文件需要更改(文件不是很长,平均每个文件 487 行,每行只有少量字符)。

对于这个问题,是否有任何方法或解决方案呢?


<details>
<summary>英文:</summary>

I`ve tried few approaches to complete this task, unfortunately none of them is working.
Main thing is get all txt files path, so i do it with this piece of code:
public static List&lt;String&gt; getFileList() throws IOException {
    int depth = 5;
    String dir = &quot;pathToMainFolder&quot;;
    Stream&lt;Path&gt; stream = Files.walk(Paths.get(dir), depth);
    List&lt;String&gt; paths = new ArrayList&lt;&gt;();
    try {
        stream.filter(file -&gt; !Files.isDirectory(file))
                .map(Path::toString)
                .filter(file -&gt; file.endsWith(&quot;txt&quot;))
                .collect(Collectors.toCollection(() -&gt; paths));
    } catch (Exception e) {
        e.printStackTrace();
    }
    return paths;
}
After that i tried changing decoding of file with this piece of code from 
&lt;https://stackoverflow.com/questions/18141162/how-to-convert-ansi-to-utf8-in-java&gt;:
   for (String path : paths) {
        Path p = Paths.get(path);
        ByteBuffer bb = ByteBuffer.wrap(Files.readAllBytes(p));
        CharBuffer cb = Charset.forName(&quot;Cp1252&quot;).decode(bb);
        bb = Charset.forName(&quot;UTF-8&quot;).encode(cb);
        Files.write(p, bb.array());

}

It changed files to UTF-8 coding, but file content is far different from my expectations. For example: is should be: `tań` but is is `ta&#241;`, should be `choć` but it is `cho&#230;`.
I tried also creating new files with `BufferedReader` and `BufferedWriter`, i was trying replacing all signs after decode change. Only thing that works is `Normalizer`:

        String everything = &quot;&quot;;
        BufferedReader br = new BufferedReader(new FileReader(path));
        try {

            StringBuilder sb = new StringBuilder();
            String line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
                    .replaceAll(&quot;[^\\p{ASCII}]&quot;, &quot;&quot;);

            while (line != null) {
                System.out.println(line);
                sb.append(line);
                sb.append(System.lineSeparator());
                line = Normalizer.normalize(br.readLine(), Normalizer.Form.NFD)
                        .replaceAll(&quot;[^\\p{ASCII}]&quot;, &quot;&quot;);
            }
            everything = sb.toString();
            System.out.println(everything);
        } finally {
            br.close();
        }
But it is the last thing i will do after there is no solution. I will also mention that there are 14k+ files to change in folder and subfolders (files are not so long avg 487 lines with few chars in each).
Any approach or solution for this problem?

</details>


# 答案1
**得分**: 1

以下是您提供的代码片段的翻译部分:

转换文件的字符集之间无需使用字符/字节缓冲区,以下是使用字符串和 getBytes 进行重新编码的简单调用示例:

private static void recode(Path input, Charset inCharset, Path output, Charset outCharset)
{
try
{
Files.createDirectories(output.getParent());
Files.write(output, new String(Files.readAllBytes(input), inCharset).getBytes(outCharset));
}
catch (IOException e)
{
throw new UncheckedIOException(e);
}
}

您需要注意仅在目录上运行一次,对于测试,最好在单独的输入/输出目录中进行构建。您的 main 方法可以使用 Files.find 进行简化,并直接处理转换:

public static void main(String[] args) throws IOException
{
Charset outCharset = StandardCharsets.UTF_8;

// 根据需要更改
Charset inCharset = Charset.forName("Cp1252");
// 或者
// Charset inCharset  = StandardCharsets.XYZ;
// 或者
// Charset inCharset = Charset.forName(System.getProperty("file.encoding"));

int depth = 5;
Path dir = Path.of("主文件夹路径");
Path outdir = Path.of("主文件夹路径.utf8");
try (Stream<Path> stream = Files.find(dir, depth, (p,a) -> a.isRegularFile() && p.getFileName().toString().endsWith(".txt")))
{
    stream.forEach(p -> recode(p, inCharset, outdir.resolve(dir.relativize(p)), outCharset));
}

}


<details>
<summary>英文:</summary>

Converting file between two character sets does not need to use Char/Byte Buffer, here is simple call to recode using String and getBytes:

    private static void recode(Path input, Charset inCharset, Path output, Charset outCharset)
    {
        try
        {
            Files.createDirectories(output.getParent());
            Files.write(output, new String(Files.readAllBytes(input), inCharset).getBytes(outCharset));
        }
        catch (IOException e)
        {
            throw new UncheckedIOException(e);
        }
    }

You&#39;ll need to take care to only run once on your directories, for tests it would be safer to build separate IN/OUT dirs. Your main can be simplified with `Files.find` and process the transformation directly:

    public static void main(String[] args) throws IOException
    {
        Charset outCharset = StandardCharsets.UTF_8;

        // Change as required
        Charset inCharset = Charset.forName(&quot;Cp1252&quot;);
        // OR
        // Charset inCharset  = StandardCharsets.XYZ;
        // OR
        // Charset inCharset = Charset.forName(System.getProperty(&quot;file.encoding&quot;));

        int depth = 5;
        Path dir = Path.of(&quot;pathToMainFolder&quot;);
        Path outdir = Path.of(&quot;pathToMainFolder.utf8&quot;);
        try (Stream&lt;Path&gt; stream = Files.find(dir, depth, (p,a) -&gt; a.isRegularFile() &amp;&amp; p.getFileName().toString().endsWith(&quot;.txt&quot;)))
        {
            stream.forEach(p -&gt; recode(p, inCharset, outdir.resolve(dir.relativize(p)), outCharset));
        }
    }


</details>



huangapple
  • 本文由 发表于 2020年8月27日 15:19:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/63611019.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定