2020年10月6日 23:55:26go评论80阅读模式

英文:

HDFS File Encoding Converter

问题

我正在尝试将一个HDFS文件从UTF-8转换为ISO-8859-1。

我编写了一个小的Java程序：

String theInputFileName = "my-utf8-input-file.csv";
String theOutputFileName = "my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;

try (
    final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName));
    final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)        
{
    try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            out.write(line.getBytes(this.outputCharset));
            out.write(this.lineSeparator.getBytes(this.outputCharset));
        }
    }
} catch (IllegalArgumentException | IOException e)
{
    RddFileWriter.LOGGER.error(e, "在文件'%s'上发生异常", theFileNameOutput);
}

此代码通过使用Spark在Hadoop集群上执行（输出数据通常由RDD提供）。

为了简化问题，我已经删除了RDD/数据集部分，直接在HDFS文件上进行操作。

当我执行此代码时：

在我的开发计算机上本地运行：正常工作！本地输出文件编码为ISO-8859-1。
在EDGE服务器上：通过使用HDFS文件的spark-submit命令正常工作！HDFS输出文件编码为ISO-8859-1。
在Datanode通过oozie：不起作用：HDFS输出文件编码为UTF-8，而不是ISO-8859-1。

我不明白是什么属性（或其他原因）导致了行为的变化。

版本：

Hadoop：v2.7.3
Spark：v2.2.0
Java：1.8

期待您的帮助。提前感谢。

英文:

I'm trying to convert an HDFS file from UTF-8 to ISO-8859-1.

I've written a small Java program :

String theInputFileName=&quot;my-utf8-input-file.csv&quot;;
String theOutputFileName=&quot;my-iso8859-output-file.csv&quot;;
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;

try (
    final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName)) ;
    final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)        
{
    try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            out.write(line.getBytes(this.outputCharset));
            out.write(this.lineSeparator.getBytes(this.outputCharset));
        }
    }
} catch (IllegalArgumentException | IOException e)
{
    RddFileWriter.LOGGER.error(e, &quot;Exception on file &#39;%s&#39;&quot;, theFileNameOutput);
}

This code is executed through a Hadoop Cluster using Spark (the output data is usually provided by a RDD)

To simplify my issue I have removed RDD/Datasets parts to work direcly on HDFS File.

When I execute the code :

Localy on my DEV computer : It Works !, local output file is encoded in ISO-8859-1
on EDGE server : via spark-submit command using HDFS Files It Works ! HDFS output file is encoded in ISO-8859-1
on Datanode via oozie : It doesn't work : HDFS outfile is encoded in UTF-8 instead of ISO-8859-1

I don't understand what properties (or something else) may be causing the change in behavior

Versions :

Hadoop : v2.7.3
Spark : v2.2.0
Java : 1.8

Looking forward to your help.
Thanks in advance

答案1

得分: 1

Finally, I found the source of my problem.

The input file on the cluster was corrupted, the whole file did not have a constant and consistent encoding.

External data are aggregated daily and recently the encoding has been changed from ISO to UTF8 without notification...

To put it more simply:

the end was correctly encoded

We have split, fixed the encoding and merged the data to repair the input.

The final code works fine.

private void changeEncoding(
            final Path thePathInputFileName,final Path thePathOutputFileName,
            final Charset theInputCharset,  final Charset theOutputCharset,
            final String theLineSeparator
        ) {
    try (
        final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
        final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
        final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
        
        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.write(theLineSeparator);
        }
        
    } catch (IllegalArgumentException | IOException e) {
        LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
    }
}

Stop your research !

英文:

Finally, I found the source of my problem.

The input file on the cluster was corrupted, the whole file did not have a constant and consistent encoding.

External data are aggregated daily and recently the encoding has been changed from ISO to UTF8 without notification...

To put it more simply:

the end was correctly encoded

We have split, fixed the encoding and merged the data to repair the input.

The final code works fine.

private void changeEncoding(
            final Path thePathInputFileName,final Path thePathOutputFileName,
            final Charset theInputCharset,  final Charset theOutputCharset,
            final String theLineSeparator
        ) {
    try (
        final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
        final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
        final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
        
        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.write(theLineSeparator);
        }
        
    } catch (IllegalArgumentException | IOException e) {
        LOGGER.error(e, &quot;Exception on file &#39;%s&#39;&quot;, theOutputFileName);
    }
}

Stop your research !

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

HDFS文件编码转换器

问题

答案1

在这个if语句中的比较逻辑。

如何在其他类中引用变量

使用pyspark读取非标准JSON格式

迭代（非递归）实现，用于聚合具有任意深度的列表列表中的项目。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论