2023年5月28日 21:11:26go评论134阅读模式

英文:

How can I make only one file in spark to s3?

问题

我有很多CSV文件。在使用Spark SQL后，我想要生成一个CSV文件。

例如，我在S3中有news1.csv、news2.csv、news3.csv等文件。我从S3中下载到Spark SQL中，并创建DataFrame。在使用Spark SQL后，我想要上传到S3中，只生成一个CSV文件。

首先，我尝试了对每个CSV文件使用spark.mode(append)：

df = spark.sql(...)

df.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test1")

但在这种情况下，追加（append）不起作用。每个CSV文件都保存在news/test1中，文件名为part-00000...、part-00000...、part-00000...。

其次，我合并了DataFrame：

df = spark.sql(...)
df_total = df_total.union(df)

df_total.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test2")

但在这种情况下，即使我创建了一个DataFrame，每个CSV文件都保存在news/test2中，文件名为part-00000...、part-00001...、part-00002...。

我应该如何保存只有一个CSV文件到S3？需要你的帮助。

英文:

I have lots of csv files. After using spark sql, I want to make the one csv file.

For example I have news1.csv, news2.csv, news3.csv, ect in S3. I download into spark sql from s3, and createDataframe. After using spark sql, I want to upload s3 with only one csv file.

At first I tried spark.mode(append) for each csv file.

df = spark.sql(...)

df.write \
  .option(&quot;header&quot;,&quot;true&quot;) \
  .option(&quot;encoding&quot;, &quot;UTF-8&quot;) \
  .mode(&quot;append&quot;) \
  .option(&quot;extracopyoptions&quot;, &quot;TIMEFORMAT &#39;auto&#39;&quot;) \
  .csv(&quot;s3a://news/test1&quot;)

But in this case. append don't work. And each csv file was saved in news/test1 with part-00000..., part-00000..., part-00000...

Second I union dataframe

df = spark.sql(...)
df_total = df_total.union(df)

df_total.write \
  .option(&quot;header&quot;,&quot;true&quot;) \
  .option(&quot;encoding&quot;, &quot;UTF-8&quot;) \
  .mode(&quot;append&quot;) \
  .option(&quot;extracopyoptions&quot;, &quot;TIMEFORMAT &#39;auto&#39;&quot;) \
  .csv(&quot;s3a://news/test2&quot;)

But in this case, even I made the one dataframe, each csv file was saved in news/test2 with part-00000..., part-00001..., part-00002...

How can I save only one csv file to s3?

I need your help

答案1

得分: 0

你可以尝试在写入结果之前使用 coalesce，以便将分区数量减少到只有1个。

df = spark.sql(...)

df.coalesce(1)
  .write \
  ...

请注意，使用 .mode("append") 不会清空输出文件夹，而是会追加新的结果，所以如果你只需要一个包含旧结果和新结果的CSV文件，考虑使用 union 选项，但要删除追加模式。

英文:

You could try to coalesce before writing the results, so it can reduce the number of partitions to only 1.

df = spark.sql(...)

df.coalesce(1)
  .write \
  ...

Note that using .mode("append") won't empty the output folder but it will append the new results, so if you need only one CSV for the old and new results then consider the union option but remove the append mode.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Spark中将一个文件上传到S3？

问题

答案1

将复杂的爆炸数据帧中的选定列添加到另一个PySpark数据帧中。

读取 Delta Lake 表的最高版本。

golang s3 download to buffer using s3manager.downloader

How do you write Data from a Spark Data Frame to Azure Application Insights from data bricks using python?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论