英文:
Mapping data flow allows duplicate records when using UPSERT
问题
使用Synapse管道和映射数据流来处理存储在ADLS中的多个每日文件,这些文件代表了给定主键列的增量插入和更新。每个每日物理文件仅包含给定主键值的一个实例。每日文件中的键/行是唯一的,但同一键值可能存在于每天的多个文件中,其中与该键列相关的属性随时间变化。所有行都流向如屏幕截图所示的Upsert条件。
Sink是一个Synapse表,其中主键只能使用非强制主键语法指定,如下所示。
在映射数据流中,最佳做法是避免将映射数据流放置在foreach活动中,以便分别处理每个文件,因为这会为每个文件启动一个新的集群,需要很长时间并且昂贵。相反,我已配置了映射数据流源以使用通配符路径一次性处理所有文件,并按文件名排序,以确保它们在单个数据流中正确排序(避免了对每个文件的foreach活动)。
在这种配置下,查看多个每日文件的单个数据流肯定会期望同一关键列存在于多行中。当从所有每日文件加载空目标表时,我们会看到多行显示为单个关键列值的任何单个插入,而其余的则会更新(基本上不执行任何更新)。
避免通过关键列处理每个文件并在for each活动中执行映射数据流以避免重复行的唯一方法是,是否有任何方法可以在不使用for each活动的情况下在单个映射数据流中处理所有文件时避免重复行?
英文:
Using Synapse pipelines and mapping data flow to process multiple daily files residing in ADLS which represent incremental inserts and updates for any given primary key column. Each daily physical file has ONLY one instance for any given primary key value. Keys/rows are unique within a daily file, but the same key value can exist in multiple files for each day where attributes related to that key column changed over time. All rows flow to the Upsert condition as shown in screen shot.
Sink is a Synapse table where primary keys can only be specified with non-enforced primary key syntax which can be seen below.
Best practice with mapping data flows is avoid placing mapping data flow within a foreach activity to process each file individually as this spins up a new cluster for each file which takes forever and gets expensive. Instead, I have configured the mapping data flow source to use wildcard path to process all files at once with a sort by file name to ensure they are ordered correctly within a single data flow (avoiding the foreach activity for each file).
Under this configuration, a single data flow looking at multiple daily files can definitely expect the same key column to exist on multiple rows. When the empty target table is first loaded from all the daily files, we get multiple rows showing up for any single key column value instead of a single INSERT for the first one and updates for the remaining ones it sees (essentially never doing any UPDATES).
The only way I avoid duplicate rows by the key column is to process each file individually and execute a mapping data flow for each file within a for each activity. Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
答案1
得分: 0
AFAIK,在不使用 ForEach 活动处理每个文件的情况下,没有其他方法可以避免重复。
当我们使用通配符时,它会一次性获取所有匹配的文件,就像下面不同文件中的相同值一样。
使用更改行条件将帮助您在只有一个单个文件的情况下执行行的插入更新操作,因为您正在使用多个文件,这将创建重复记录,类似于 Leon Yue 的答案。
根据所述情景,您有多个文件中的相同值,希望避免重复。要避免这种情况,您必须对每个文件进行迭代,然后在该文件上执行数据流操作,以避免重复插入。
英文:
> Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
AFAIK, there is no other way than using ForEach loop to process file one by one.
When we use wildcard, it takes all the matching file in the one go. like below same values from different file.
using alter rows condition will help you to upsert rows if you have only on single file as you are using multiple files this will create duplicate records like this similar question Answer by Leon Yue.
As scenario explained you have same values in multiple files, and you want to avoid that to being getting duplicated. to avoid this, you have to iterate over each of the file and then perform dataflow operations on that file to avoid duplicates getting upsert.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论