why my file is generated in ADLS like this 'part-00000-6feffff6-eef6-41ec-9d' even i mention the filename on adf

huangapple go评论67阅读模式
英文:

why my file is generated in ADLS like this 'part-00000-6feffff6-eef6-41ec-9d' even i mention the filename on adf

问题

我在adf上使用了数据流来筛选数据集的列,并将输出存储在一个名为filename.csv的文件中,如您在图片中所见。

但在adls中生成了一个新文件,名称为:part-00000-6feffff6-eef6-41ec-9da8-c10e671923df-。

英文:

I used a Data Flow on adf for filter the column of dataset and stored the output on a filename.csv file as you can see on pic
enter image description here

but in adls a new file generated with this name: part-00000-6feffff6-eef6-41ec-9da8-c10e671923df-

enter image description here

答案1

得分: 1

我同意@Gal Weiss的观点,数据流在写入文件时遵循了Spark的方式。

此外,如果您想将其写入单个文件,请转至Sink设置-> 文件名选项 -> 输出到单个文件,然后在其中输入您的文件名。

但是为此,我们需要将分区设置为单个分区,这会减慢数据流的执行时间。

why my file is generated in ADLS like this 'part-00000-6feffff6-eef6-41ec-9d' even i mention the filename on adf
这将把它写入与下面类似的同一目标文件夹中。

why my file is generated in ADLS like this 'part-00000-6feffff6-eef6-41ec-9d' even i mention the filename on adf

英文:

I agree with @Gal Weiss that dataflow follows the spark way when writing to the files.

Adding to that answer, if you want to get it write to a single file, go to Sink settings-> File name option -> Output to single file and give your file name in it.

But for this, we need to set the partitioning to Single partition which slowdowns the dataflow execution time.

why my file is generated in ADLS like this 'part-00000-6feffff6-eef6-41ec-9d' even i mention the filename on adf.
This will write it to a single file in the same target folder like below.

why my file is generated in ADLS like this 'part-00000-6feffff6-eef6-41ec-9d' even i mention the filename on adf

答案2

得分: 0

这是一个分区文件,
我不熟悉你的数据流程是如何工作的。
但如果这是使用Spark或Hadoop的东西,这是保存文件的方式。
它将它们分割成多个分区以便并行处理。
在这种情况下,您分配的名称将像目录名称一样运作,实际数据位于“part”文件中。
这没问题,如果您尝试使用相同的文件名读取数据集,Spark/Hadoop文件系统会知道在其中查找“part”文件。

最后,你只有一个文件的原因是因为数据在写入之前被分区,或者你的数据集很小,而默认配置是只使用一个分区。

英文:

This is a partition file,
I'm not familiar with the way your data flow works.
But if this is something that using spark or Hadoop, this is the way it saves files
It splites them to multiple partitions for simplicity of parallel processing.
In this case the name you assign will function like a directory name and the actual data is in the "part" files.
That is ok, if you will try to read the dataset using the same file name, the spark/Hadoop filesystem knows to look for the "part" files under it.

Lastly, the reason you have only one file is because of the way the data was partitioned before being written, or you have very small dataset and the default configuration is to use only one partition.

huangapple
  • 本文由 发表于 2023年7月18日 03:22:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76707529.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定