如何在Spark中将一个文件上传到S3?

huangapple go评论62阅读模式
英文:

How can I make only one file in spark to s3?

问题

我有很多CSV文件。在使用Spark SQL后,我想要生成一个CSV文件。

例如,我在S3中有news1.csv、news2.csv、news3.csv等文件。我从S3中下载到Spark SQL中,并创建DataFrame。在使用Spark SQL后,我想要上传到S3中,只生成一个CSV文件。

首先,我尝试了对每个CSV文件使用spark.mode(append)

df = spark.sql(...)

df.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test1")

但在这种情况下,追加(append)不起作用。每个CSV文件都保存在news/test1中,文件名为part-00000...、part-00000...、part-00000...。

其次,我合并了DataFrame:

df = spark.sql(...)
df_total = df_total.union(df)

df_total.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test2")

但在这种情况下,即使我创建了一个DataFrame,每个CSV文件都保存在news/test2中,文件名为part-00000...、part-00001...、part-00002...。

我应该如何保存只有一个CSV文件到S3?需要你的帮助。

英文:

I have lots of csv files. After using spark sql, I want to make the one csv file.

For example I have news1.csv, news2.csv, news3.csv, ect in S3. I download into spark sql from s3, and createDataframe. After using spark sql, I want to upload s3 with only one csv file.

At first I tried spark.mode(append) for each csv file.

df = spark.sql(...)

df.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test1")

But in this case. append don't work. And each csv file was saved in news/test1 with part-00000..., part-00000..., part-00000...

Second I union dataframe

df = spark.sql(...)
df_total = df_total.union(df)

df_total.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test2")

But in this case, even I made the one dataframe, each csv file was saved in news/test2 with part-00000..., part-00001..., part-00002...

How can I save only one csv file to s3?

I need your help

答案1

得分: 0

你可以尝试在写入结果之前使用 coalesce,以便将分区数量减少到只有1个。

df = spark.sql(...)

df.coalesce(1)
  .write \
  ...

请注意,使用 .mode("append") 不会清空输出文件夹,而是会追加新的结果,所以如果你只需要一个包含旧结果和新结果的CSV文件,考虑使用 union 选项,但要删除追加模式。

英文:

You could try to coalesce before writing the results, so it can reduce the number of partitions to only 1.

df = spark.sql(...)

df.coalesce(1)
  .write \
  ...

Note that using .mode("append") won't empty the output folder but it will append the new results, so if you need only one CSV for the old and new results then consider the union option but remove the append mode.

huangapple
  • 本文由 发表于 2023年5月28日 21:11:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76351671.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定