在Talend Open Studio中是否有一种方法可以合并行以填充空值?

huangapple go评论68阅读模式
英文:

Is there any way to merge rows to fill null values in Talend Open Studio?

问题

我在使用Talend Open Studio时遇到了困难。

我的问题是,如何用相同的键从相同的列中填充空值,使其变为非空值?

假设我有这样的源数据。

EmployeeID | Part A列 | Part B列 | Part C列
EE1000001 | Part A值 | null | null
EE1000001 | null | Part B值 | null
EE1000001 | null | Part B值 | null
EE1000001 | null | null | Part C值
EE1000001 | null | null | Part C值
EE1000001 | null | null | Part C值
EE1000002 | Part A值 | null | null
EE1000002 | null | Part B值 | null
EE1000002 | null | null | Part C值

我希望得到以下结果:

EmployeeID | Part A列 | Part B列 | Part C列
EE1000001 | Part A值 | Part B值 | Part C值
EE1000001 | null | Part B值 | Part C值
EE1000001 | null | null | Part C值
EE1000002 | Part A值 | Part B值 | Part C值

我尝试了几种方法来解决这个问题,但没有找到一个合适的方法。

如果您有想法,请帮助我。

**添加的内容

更直观的示例

因此,每个键可能在同一列中具有多个值,它们不应该用逗号分隔在同一行中,例如“C-1, C-2, C-3”,它们应该从具有相同键的第一行顶部填充。

这就是为什么第一个ID有三行而第二个ID只有一行的原因。

英文:

I have difficulty, working using Talend Open Studio.

My question is,

how can I fill the null values with NOTNULL values from the same columns with the same keys?

Suppose that I have source data like this.

EmployeeID | Part A Columns | Part B Columns | Part C Columns<br>
EE1000001 | &nbsp; Part A Values &nbsp; | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; &nbsp; null<br>
EE1000001 | &nbsp; &nbsp; null &nbsp; &nbsp; |&nbsp; Part B Values &nbsp; | &nbsp; &nbsp; null<br>
EE1000001 | &nbsp; &nbsp; null &nbsp; &nbsp; |&nbsp; Part B Values &nbsp; | &nbsp; &nbsp; null<br>
EE1000001 | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; Part C Values<br>
EE1000001 | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; Part C Values<br>
EE1000001 | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; Part C Values<br>
EE1000002 | &nbsp; Part A Values &nbsp; | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; &nbsp; null<br>
EE1000002 | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; Part B Values &nbsp; | &nbsp; &nbsp; null<br>
EE1000002 | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; Part C Values<br>

<br>
And I'd like to get result like following:<br>

EmployeeID | Part A Columns | Part B Columns | Part C Columns<br>
EE1000001 | &nbsp; Part A Values &nbsp; |&nbsp; Part B Values &nbsp; | &nbsp; Part C Values<br>
EE1000001 | &nbsp; &nbsp; null &nbsp; &nbsp; |&nbsp; Part B Values &nbsp; | &nbsp; Part C Values<br>
EE1000001 | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; &nbsp; null &nbsp; &nbsp; | &nbsp; Part C Values<br>
EE1000002 | &nbsp; Part A Values &nbsp; | &nbsp; Part B Values &nbsp; | &nbsp; Part C Values<br>

I've tried several ways to solve this, but I couldn't find one.

If you have an idea, please help me.

** Added

More intuitive example

So, each key might have multiple values for the same column,

and they should not be in the same row with commas like "C-1, C-2, C-3",

and they should be filled from the top of the first row with the same key.

This is the reason the first ID has three rows while the second one has only one row.

答案1

得分: 0

使用一个tMap和类似的coalesce函数。在tMap中,您可以连接这两个数据集(默认情况下,它执行左连接,非常适合您),然后执行以下操作:

A == null ? B : A

将会得到您需要的结果。

英文:

Use a tMap and a coalesce like function. In the tMap you can join the 2 dataset. (by default it is doing a left join which is perfect for you) then doing this:

A == null ? B : A

would get what you need.

答案2

得分: 0

我自己找到了其中一种解决方案,并将其分享。

解决方案的关键是组件 "tDenormalize" 和每行的另一个键值。

如果在仅使用 tDenormalize 组件时没有另一个键列,您将获得一个列中的多个值的结果,这些值由您编写的分隔符分隔,而我说的分隔符不应该与相同列中的分隔符在一起。

要获得与我在问题中想要的完全相同的结果,请为行提供额外的键值。

我在作业之前做了类似这样的事情:

row2.tmpKey = row1.Numeric.sequence(row1.EmployeeID + "PartA",1,1);

所以,原始数据会像这样:

EE_ID,ColumnA,ColumnB,ColumnC,TmpKey
EE001,Part A value,null,null,1
EE001,null,Part B value,null,1
EE001,null,Part B value,null,2
EE001,null,null,Part C value,1
...

然后,在 tDenormalize 组件视图的基本设置中设置 "要去规范化的列:ColumnA,ColumnB,ColumnC"。

英文:

I figured out one of the solutions by myself, and I'm gonna share it.

The keys for the solution are the component "tDenormalize" and another key value for each row.

Without another key column when you use only tDenormalize component, you would get the result of multiple values in a column of a row separated by the delimiter that you wrote, which I said shouldn't be in the same column with delimiters.

To get the exact same result that I wanted in the question, give rows additional key values.

I did something like this as pre-job:

row2.tmpKey = row1.Numeric.sequence(row1.EmployeeID + "PartA",1,1);

So, the raw data would be like:<br>
EE_ID,ColumnA,ColumnB,ColumnC,TmpKey<br>
EE001,Part A value,null,null,1<br>
EE001,null,Part B value,null,1<br>
EE001,null,Part B value,null,2<br>
EE001,null,null,Part C value,1<br>
...

Then you set "To denormalize columns: ColumnA, ColumnB, ColumnC" in Basic Settings of tDenormalize component view.

huangapple
  • 本文由 发表于 2020年1月3日 20:50:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/59578958.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定