HDFS文件编码转换器

huangapple go评论80阅读模式
英文:

HDFS File Encoding Converter

问题

我正在尝试将一个HDFS文件从UTF-8转换为ISO-8859-1

我编写了一个小的Java程序:

String theInputFileName = "my-utf8-input-file.csv";
String theOutputFileName = "my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;

try (
    final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName));
    final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)        
{
    try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            out.write(line.getBytes(this.outputCharset));
            out.write(this.lineSeparator.getBytes(this.outputCharset));
        }
    }
} catch (IllegalArgumentException | IOException e)
{
    RddFileWriter.LOGGER.error(e, "在文件'%s'上发生异常", theFileNameOutput);
}

此代码通过使用Spark在Hadoop集群上执行(输出数据通常由RDD提供)。

为了简化问题,我已经删除了RDD/数据集部分,直接在HDFS文件上进行操作。

当我执行此代码时:

  • 在我的开发计算机上本地运行:正常工作!本地输出文件编码为ISO-8859-1
  • EDGE服务器上:通过使用HDFS文件的spark-submit命令正常工作!HDFS输出文件编码为ISO-8859-1
  • Datanode通过oozie:不起作用:HDFS输出文件编码为UTF-8,而不是ISO-8859-1

我不明白是什么属性(或其他原因)导致了行为的变化。

版本:

  • Hadoop:v2.7.3
  • Spark:v2.2.0
  • Java:1.8

期待您的帮助。提前感谢。

英文:

I'm trying to convert an HDFS file from UTF-8 to ISO-8859-1.

I've written a small Java program :

String theInputFileName="my-utf8-input-file.csv";
String theOutputFileName="my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;

try (
    final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName)) ;
    final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)        
{
    try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            out.write(line.getBytes(this.outputCharset));
            out.write(this.lineSeparator.getBytes(this.outputCharset));
        }
    }
} catch (IllegalArgumentException | IOException e)
{
    RddFileWriter.LOGGER.error(e, "Exception on file '%s'", theFileNameOutput);
}

This code is executed through a Hadoop Cluster using Spark (the output data is usually provided by a RDD)

To simplify my issue I have removed RDD/Datasets parts to work direcly on HDFS File.

When I execute the code :

  • Localy on my DEV computer : It Works !, local output file is encoded in ISO-8859-1
  • on EDGE server : via spark-submit command using HDFS Files It Works ! HDFS output file is encoded in ISO-8859-1
  • on Datanode via oozie : It doesn't work HDFS文件编码转换器 : HDFS outfile is encoded in UTF-8 instead of ISO-8859-1

I don't understand what properties (or something else) may be causing the change in behavior

Versions :

  • Hadoop : v2.7.3
  • Spark : v2.2.0
  • Java : 1.8

Looking forward to your help.
Thanks in advance

答案1

得分: 1

Finally, I found the source of my problem.

The input file on the cluster was corrupted, the whole file did not have a constant and consistent encoding.

External data are aggregated daily and recently the encoding has been changed from ISO to UTF8 without notification...

To put it more simply:

  • the start contained bad conversion "« é ê è »" instead of "« é ê è »"
  • the end was correctly encoded

We have split, fixed the encoding and merged the data to repair the input.

The final code works fine.

private void changeEncoding(
            final Path thePathInputFileName,final Path thePathOutputFileName,
            final Charset theInputCharset,  final Charset theOutputCharset,
            final String theLineSeparator
        ) {
    try (
        final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
        final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
        final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
        
        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.write(theLineSeparator);
        }
        
    } catch (IllegalArgumentException | IOException e) {
        LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
    }
}

Stop your research !
HDFS文件编码转换器

英文:

Finally, I found the source of my problem.

The input file on the cluster was corrupted, the whole file did not have a constant and consistent encoding.

External data are aggregated daily and recently the encoding has been changed from ISO to UTF8 without notification...

To put it more simply:

  • the start contained bad conversion « é ê è » instead of « é ê è »
  • the end was correctly encoded

We have split, fixed the encoding and merged the data to repair the input.

The final code works fine.

private void changeEncoding(
            final Path thePathInputFileName,final Path thePathOutputFileName,
            final Charset theInputCharset,  final Charset theOutputCharset,
            final String theLineSeparator
        ) {
    try (
        final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
        final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
        final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
        
        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.write(theLineSeparator);
        }
        
    } catch (IllegalArgumentException | IOException e) {
        LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
    }
}

Stop your research !
HDFS文件编码转换器

huangapple
  • 本文由 发表于 2020年10月6日 23:55:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/64229496.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定