英文:
HDFS File Encoding Converter
问题
我正在尝试将一个HDFS文件从UTF-8
转换为ISO-8859-1
。
我编写了一个小的Java程序:
String theInputFileName = "my-utf8-input-file.csv";
String theOutputFileName = "my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;
try (
final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName));
final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)
{
try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
{
String line;
while ((line = reader.readLine()) != null)
{
out.write(line.getBytes(this.outputCharset));
out.write(this.lineSeparator.getBytes(this.outputCharset));
}
}
} catch (IllegalArgumentException | IOException e)
{
RddFileWriter.LOGGER.error(e, "在文件'%s'上发生异常", theFileNameOutput);
}
此代码通过使用Spark
在Hadoop集群上执行(输出数据通常由RDD提供)。
为了简化问题,我已经删除了RDD/数据集部分,直接在HDFS文件上进行操作。
当我执行此代码时:
- 在我的开发计算机上本地运行:正常工作!本地输出文件编码为
ISO-8859-1
。 - 在EDGE服务器上:通过使用HDFS文件的spark-submit命令正常工作!HDFS输出文件编码为
ISO-8859-1
。 - 在Datanode通过oozie:不起作用:HDFS输出文件编码为
UTF-8
,而不是ISO-8859-1
。
我不明白是什么属性(或其他原因)导致了行为的变化。
版本:
- Hadoop:v2.7.3
- Spark:v2.2.0
- Java:1.8
期待您的帮助。提前感谢。
英文:
I'm trying to convert an HDFS file from UTF-8
to ISO-8859-1
.
I've written a small Java program :
String theInputFileName="my-utf8-input-file.csv";
String theOutputFileName="my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;
try (
final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName)) ;
final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)
{
try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
{
String line;
while ((line = reader.readLine()) != null)
{
out.write(line.getBytes(this.outputCharset));
out.write(this.lineSeparator.getBytes(this.outputCharset));
}
}
} catch (IllegalArgumentException | IOException e)
{
RddFileWriter.LOGGER.error(e, "Exception on file '%s'", theFileNameOutput);
}
This code is executed through a Hadoop Cluster using Spark
(the output data is usually provided by a RDD)
To simplify my issue I have removed RDD/Datasets parts to work direcly on HDFS File.
When I execute the code :
- Localy on my DEV computer : It Works !, local output file is encoded in
ISO-8859-1
- on EDGE server : via spark-submit command using HDFS Files It Works ! HDFS output file is encoded in
ISO-8859-1
- on Datanode via oozie : It doesn't work : HDFS outfile is encoded in
UTF-8
instead ofISO-8859-1
I don't understand what properties (or something else) may be causing the change in behavior
Versions :
- Hadoop : v2.7.3
- Spark : v2.2.0
- Java : 1.8
Looking forward to your help.
Thanks in advance
答案1
得分: 1
Finally, I found the source of my problem.
The input file on the cluster was corrupted, the whole file did not have a constant and consistent encoding.
External data are aggregated daily and recently the encoding has been changed from ISO to UTF8 without notification...
To put it more simply:
- the start contained bad conversion "« é ê è »" instead of "« é ê è »"
- the end was correctly encoded
We have split, fixed the encoding and merged the data to repair the input.
The final code works fine.
private void changeEncoding(
final Path thePathInputFileName,final Path thePathOutputFileName,
final Charset theInputCharset, final Charset theOutputCharset,
final String theLineSeparator
) {
try (
final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(theLineSeparator);
}
} catch (IllegalArgumentException | IOException e) {
LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
}
}
Stop your research !
英文:
Finally, I found the source of my problem.
The input file on the cluster was corrupted, the whole file did not have a constant and consistent encoding.
External data are aggregated daily and recently the encoding has been changed from ISO to UTF8 without notification...
To put it more simply:
- the start contained bad conversion « é ê è » instead of « é ê è »
- the end was correctly encoded
We have split, fixed the encoding and merged the data to repair the input.
The final code works fine.
private void changeEncoding(
final Path thePathInputFileName,final Path thePathOutputFileName,
final Charset theInputCharset, final Charset theOutputCharset,
final String theLineSeparator
) {
try (
final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(theLineSeparator);
}
} catch (IllegalArgumentException | IOException e) {
LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
}
}
Stop your research !
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论