英文:
How to do two step sorting based on condition in dataflow ADF
问题
我正在处理ADF数据流,并尝试实现源数据的两步排序逻辑。要求如下,我有一个主键和2个日期列:
示例- id、date1 和 date2
要求-
(i) 如果源文件中有相同 id 的重复数据,则应选择具有最大 date1 的行,否则,
(ii) 如果具有相同 id 的重复数据,并且它们的 date1 也相同,则应选择具有最大 date2 的行并发送到输出。
我尝试在聚合阶段连续进行两次排序,但目前我从重复行中获得随机值,这是错误的。
有人可以帮助我满足这个要求吗?谢谢。
英文:
I am working on ADF data flow and I am trying to implement a logic for two step sorting in my source data. The requirement is like, I have 1 primary key and 2 date columns :
example- id, date1 and date2
Requirement-
(i) if there is duplicate data with same id in source file then the row which has maximum date1 should be picked else,
(ii) if there is duplicate data with same id and if the date1 is also same for them then the row which has maximum date2 should be picked and sent to output.
I tried giving two sorting one after another in Aggregate stage but currently I am getting random values from both duplicate rows which is wrong.
Can anyone help me to get this requirement? Thank you
答案1
得分: 1
为了获取相同id的最大date1和相同id和date1组合的最大date2,首先必须获取相同id和date1的最大date2值。然后检查id列的最大date1。以下是详细的步骤。
- 使用Source转换从源文件中读取数据。
示例输入
id | date1 | date2 |
---|---|---|
1 | 2023-01-01 | 2023-01-03 |
1 | 2023-01-02 | 2023-01-02 |
2 | 2023-01-02 | 2023-01-01 |
2 | 2023-01-02 | 2023-01-02 |
所采用的示例输入具有三列:id、date1和date2。
- 使用聚合转换将数据按id和date1列分组,并计算每个组的date2的最大值。这将确保对于每个id和date1组合,您获得date2的最大值。您可以在聚合转换中使用以下表达式:
groupBy(id, date1),
date2 = max(date2)
此转换的输出将具有三列:id、date1和date2(date2的最大值)。
id | date1 | date2 |
---|---|---|
1 | 2023-01-01 | 2023-01-03 |
1 | 2023-01-02 | 2023-01-02 |
2 | 2023-01-02 | 2023-01-02 |
- 使用另一个聚合转换将数据按id列分组,并计算每个id组的date1的最大值。这将确保对于每个id,您获得date1的最大值。您可以在聚合转换中使用以下表达式:
groupBy(id),
date1 = max(date1)
id | date1 |
---|---|
1 | 2023-01-02 |
2 | 2023-01-02 |
-
然后使用Join转换基于id和date1列将两个聚合转换的输出进行连接。
-
使用选择转换从Join转换的输出中选择id、date1和date2列,并移除重复字段。
id | date1 | date2 |
---|---|---|
1 | 2023-01-02 | 2023-01-02 |
2 | 2023-01-02 | 2023-01-02 |
这将确保您获取符合您的两步排序逻辑的行。
英文:
In order to get the maximum date1 for the same ids and maximum date2 for same id,date1 combination, you have to first get the max date2 value for the same id and date1. Then check for the maximum date1 for id column. Below is the detailed approach.
- Use a Source transformation to read the data from your source file.
Sample input
id | date1 | date2 |
---|---|---|
1 | 2023-01-01 | 2023-01-03 |
1 | 2023-01-02 | 2023-01-02 |
2 | 2023-01-02 | 2023-01-01 |
2 | 2023-01-02 | 2023-01-02 |
The sample input that is taken has three columns: id, date1, and date2.
-
Use an Aggregate transformation to group the data by the id and date1 columns and calculate the maximum value of date2 for each group. This will ensure that for each id and date1 combination, you get the maximum value of date2. You can use the following expression in the Aggregate transformation:
groupBy(id, date1), date2 = max(date2)
The output of this transformation will have three columns: id, date1, and date2 (max value of date2).
id | date1 | date2 |
---|---|---|
1 | 2023-01-01 | 2023-01-03 |
1 | 2023-01-02 | 2023-01-02 |
2 | 2023-01-02 | 2023-01-02 |
-
Use another Aggregate transformation to group the data by the id column and calculate the maximum values of date1 for each id group. This will ensure that for each id, you get the maximum values of date1. You can use the following expression in the Aggregate transformation:
groupBy(id), date1 = max(date1)
id | date1 |
---|---|
1 | 2023-01-02 |
2 | 2023-01-02 |
- Then use the Join transformation to join the output of the two Aggregate transformations based on the id and date1 columns.
- Use the select transformation to select the id, date1, and date2 columns from the output of the Join transformation and remove duplicate fields.
id | date1 | date2 |
---|---|---|
1 | 2023-01-02 | 2023-01-02 |
2 | 2023-01-02 | 2023-01-02 |
This will ensure that you get the rows that satisfy your two-step sorting logic.
答案2
得分: 0
使用排序转换根据第一个条件对数据进行排序。
将排序转换连接到条件拆分转换。
英文:
Use a Sort transformation to sort the data based on the first condition.
Connect the Sort transformation to a Conditional Split transformation.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论