2020年4月6日 23:36:47go评论89阅读模式

英文:

How to write a spark dataframe tab delimited as a text file using java

问题

我有一个包含许多列的Spark Dataset<Row>，需要将其写入一个以制表符分隔的文本文件中。使用csv格式很容易指定该选项，但是在Java中使用文本文件时该如何处理呢？

英文:

I have a Spark Dataset<Row> with lot of columns that have to be written to a text file with a tab delimiter. With csv its easy to specify that option, but how to handle this for a text file when using Java?

答案1

得分: 3

Option 1:

yourDf
.coalesce(1) // 如果你想保存为单个文件
.write
.option("sep", "\t")
.option("encoding", "UTF-8")
.csv("outputpath")

与写入 CSV 相同，但这里需要使用制表符分隔。

是的，正如你在注释中提到的，如果你想要重命名文件，你可以执行以下操作：

import org.apache.hadoop.fs.FileSystem;
FileSystem fs = FileSystem.get(spark.sparkContext.hadoopConfiguration);
fs.rename(new Path("outputpath"), new Path("outputpath.txt"));

注意：

如果在 outputpath 下有多个文件，你可以使用 fs.globStatus，在这种情况下 coalesce(1) 会生成单个 CSV，因此不需要。
如果你使用的是 S3 而不是 HDFS，你可能需要在尝试重命名之前进行如下设置：

spark.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");

Option 2:

另一种选项（如果你不想使用 CSV API），可以像下面这样：

yourDf.rdd
.coalesce(1)
.map(x => x.mkString("\t"))
.saveAsTextFile("yourfile.txt");

英文:

Option 1 :

    yourDf
    .coalesce(1) // if you want to save as single file
    .write
    .option(&quot;sep&quot;, &quot;\t&quot;)
    .option(&quot;encoding&quot;, &quot;UTF-8&quot;)
    .csv(&quot;outputpath&quot;)

same as writing csv but here tab delimeter you need to use.

Yes its csv as you mentioned in the comment, if you want to rename the file you can do the below..


import org.apache.hadoop.fs.FileSystem;
FileSystem fs = FileSystem.get(spark.sparkContext.hadoopConfiguration);
fs.rename(new Path(&quot;outputpath&quot;), new Path(outputpath.txt))

Note :

you can use fs.globStatus if you have multiple file under your outputpath inthis case coalesce(1) will make single csv, hence not needed.
if you are using s3 instead of hdfs you may need to set below before attempting to rename...

spark.sparkContext.hadoopConfiguration.set(&quot;fs.s3.impl&quot;, &quot;org.apache.hadoop.fs.s3native.NativeS3FileSystem&quot;)

Option 2 :

Other option (if you don't want use csv api) could be like below

 yourDf.rdd
.coalesce(1)
.map(x =&gt; x.mkString(&quot;\t&quot;))
.saveAsTextFile(&quot;yourfile.txt&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Java将Spark DataFrame 以制表符分隔的形式写入文本文件

问题

答案1

Option 2:

Option 2 :

Java将类从Java 14转换为1.7

eclipse milo opcua客户端连接至Prosys服务器问题

安卓应用程序在新意图上持续崩溃。

只有在使用List.forEach时才会出现未处理的异常类型错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论