如何使用Java将Spark DataFrame 以制表符分隔的形式写入文本文件

huangapple go评论74阅读模式
英文:

How to write a spark dataframe tab delimited as a text file using java

问题

我有一个包含许多列的Spark Dataset<Row>,需要将其写入一个以制表符分隔的文本文件中。使用csv格式很容易指定该选项,但是在Java中使用文本文件时该如何处理呢?

英文:

I have a Spark Dataset&lt;Row&gt; with lot of columns that have to be written to a text file with a tab delimiter. With csv its easy to specify that option, but how to handle this for a text file when using Java?

答案1

得分: 3

Option 1:

yourDf
.coalesce(1) // 如果你想保存为单个文件
.write
.option("sep", "\t")
.option("encoding", "UTF-8")
.csv("outputpath")

与写入 CSV 相同,但这里需要使用制表符分隔。

是的,正如你在注释中提到的,如果你想要重命名文件,你可以执行以下操作:

import org.apache.hadoop.fs.FileSystem;
FileSystem fs = FileSystem.get(spark.sparkContext.hadoopConfiguration);
fs.rename(new Path("outputpath"), new Path("outputpath.txt"));

注意:

  1. 如果在 outputpath 下有多个文件,你可以使用 fs.globStatus,在这种情况下 coalesce(1) 会生成单个 CSV,因此不需要。
  2. 如果你使用的是 S3 而不是 HDFS,你可能需要在尝试重命名之前进行如下设置:
spark.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");

Option 2:

另一种选项(如果你不想使用 CSV API),可以像下面这样:

yourDf.rdd
.coalesce(1)
.map(x => x.mkString("\t"))
.saveAsTextFile("yourfile.txt");

英文:

Option 1 :

    yourDf
    .coalesce(1) // if you want to save as single file
    .write
    .option(&quot;sep&quot;, &quot;\t&quot;)
    .option(&quot;encoding&quot;, &quot;UTF-8&quot;)
    .csv(&quot;outputpath&quot;)

same as writing csv but here tab delimeter you need to use.

Yes its csv as you mentioned in the comment, if you want to rename the file you can do the below..


import org.apache.hadoop.fs.FileSystem;
FileSystem fs = FileSystem.get(spark.sparkContext.hadoopConfiguration);
fs.rename(new Path(&quot;outputpath&quot;), new Path(outputpath.txt))

Note :

  1. you can use fs.globStatus if you have multiple file under your outputpath inthis case coalesce(1) will make single csv, hence not needed.
  2. if you are using s3 instead of hdfs you may need to set below before attempting to rename...
spark.sparkContext.hadoopConfiguration.set(&quot;fs.s3.impl&quot;, &quot;org.apache.hadoop.fs.s3native.NativeS3FileSystem&quot;)

Option 2 :

Other option (if you don't want use csv api) could be like below

 yourDf.rdd
.coalesce(1)
.map(x =&gt; x.mkString(&quot;\t&quot;))
.saveAsTextFile(&quot;yourfile.txt&quot;)

huangapple
  • 本文由 发表于 2020年4月6日 23:36:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/61063446.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定