将两个数据框按照限制一个数据框的行数进行连接。

huangapple go评论119阅读模式
英文:

Join two dataframes with limiting the rows of one dataframe

问题

  1. 我有两个数据帧:
  2. df1
  3. +--------------+---------------------+
  4. |id_device |tracking_time |
  5. +--------------+---------------------+
  6. |20 |2020-02-19 02:37:45 |
  7. |5 |2020-02-17 17:15:45 |
  8. +--------------+---------------------+
  9. df2
  10. +--------------+----------------------+
  11. |id_device |tracking_time |
  12. +--------------+----------------------+
  13. |20 | 2019-02-19 02:41:45 |
  14. |20 |2020-01-17 17:15:45 |
  15. +--------------+----------------------+
  16. 我想要获得以下输出:
  17. +--------------+---------------------+------------------+
  18. |id_device |tracking_time | df2.tracking_time |
  19. +--------------+---------------------+------------------+
  20. |20 |2020-02-19 02:37:45 |2019-02-19 02:41:45|
  21. |5 |2020-02-17 17:15:45 |null |
  22. +--------------+---------------------+------------------+
  23. 我尝试了以下代码:
  24. df1.registerTempTable("data");
  25. df2.createOrReplaceTempView("tdays");
  26. Dataset<Row> d_f = sparkSession.sql("select a.* , b.* from data as a LEFT JOIN (select * from tdays ) as b on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");
  27. 我得到了以下输出:
  28. +----------------------+---------------------+--------------------+--------------------+
  29. |id_device |tracking_time | b.id_device |b.tracking_time |
  30. +----------------------+---------------------+--------------------+--------------------+
  31. |20 |2020-02-19 02:37:45 |20 | 2019-02-19 02:41:45|
  32. |20 |2020-02-19 02:37:45 |20 | 2020-01-17 17:15:45|
  33. |5 |2020-02-17 17:15:45 |null |null |
  34. +-----------------------+--------------------+--------------------+--------------------+
  35. 我想要的是将第一个数据帧与左连接结果连接,“按df2.tracking_time降序排序并限制为1”。
  36. 我需要您的帮助
英文:

I have two dataframes :

  1. df1:
  2. +--------------+---------------------+
  3. |id_device |tracking_time |
  4. +--------------+---------------------+
  5. |20 |2020-02-19 02:37:45 |
  6. |5 |2020-02-17 17:15:45 |
  7. +--------------+---------------------+
  8. df2
  9. +--------------+----------------------+
  10. |id_device |tracking_time |
  11. +--------------+----------------------+
  12. |20 | 2019-02-19 02:41:45 |
  13. |20 |2020-01-17 17:15:45 |
  14. +--------------+----------------------+

I want to get the following output :

  1. +--------------+---------------------+------------------+
  2. |id_device |tracking_time | df2.tracking_time |
  3. +--------------+---------------------+------------------+
  4. |20 |2020-02-19 02:37:45 |2019-02-19 02:41:45|
  5. |5 |2020-02-17 17:15:45 |null |
  6. +--------------+---------------------+-------------------+

I tried the following code :

  1. df1.registerTempTable(&quot;data&quot;);
  2. df2.createOrReplaceTempView(&quot;tdays&quot;);
  3. Dataset&lt;Row&gt; d_f = sparkSession.sql(&quot;select a.* , b.* from data as a LEFT JOIN (select * from tdays ) as b on b.id_device == a.id_device and b.tracking_time &lt; a.tracking_time &quot;);

I get the following output :

  1. +----------------------+---------------------+--------------------+------------------ -+
  2. |id_device |tracking_time | b.id_device |b.tracking_time |
  3. +----------------------+---------------------+--------------------+--------------------+
  4. |20 |2020-02-19 02:37:45 |20 | 2019-02-19 02:41:45|
  5. |20 |2020-02-19 02:37:45 |20 | 2020-01-17 17:15:45|
  6. |5 |2020-02-17 17:15:45 |null |null |
  7. +-----------------------+--------------------+--------------------+--------------------+

What I want is to join the first dataframe with result of left join ordered by df2.tracking_time desc limit 1

I need your help

答案1

得分: 1

在连接操作之前,您可以将 df2 缩减为每个 id_device 的最小日期:

  1. val df1 = ...
  2. val df2 = ...
  3. val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
  4. val result = df1.join(df2min, Seq("id_device"), "left")

df2min 只包含一个包含来自 df2 的最小日期的行,因此左连接将返回预期的结果。

英文:

Before the join, you can reduce df2 to the minimum dates for each id_device:

  1. val df1 = ...
  2. val df2 = ...
  3. val df2min = df2.groupBy(&quot;id_device&quot;).agg(min(&quot;tracking_time&quot;)).as(&quot;df2.tracking_time&quot;)
  4. val result = df1.join(df2min, Seq(&quot;id_device&quot;), &quot;left&quot;)

df2min contains only a single row with the minimum date from df2 per id. Therefore the left join will return the expected result.

huangapple
  • 本文由 发表于 2020年8月23日 18:29:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/63545901.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定