英文:
Join two dataframes with limiting the rows of one dataframe
问题
我有两个数据帧:
df1:
+--------------+---------------------+
|id_device |tracking_time |
+--------------+---------------------+
|20 |2020-02-19 02:37:45 |
|5 |2020-02-17 17:15:45 |
+--------------+---------------------+
df2:
+--------------+----------------------+
|id_device |tracking_time |
+--------------+----------------------+
|20 | 2019-02-19 02:41:45 |
|20 |2020-01-17 17:15:45 |
+--------------+----------------------+
我想要获得以下输出:
+--------------+---------------------+------------------+
|id_device |tracking_time | df2.tracking_time |
+--------------+---------------------+------------------+
|20 |2020-02-19 02:37:45 |2019-02-19 02:41:45|
|5 |2020-02-17 17:15:45 |null |
+--------------+---------------------+------------------+
我尝试了以下代码:
df1.registerTempTable("data");
df2.createOrReplaceTempView("tdays");
Dataset<Row> d_f = sparkSession.sql("select a.* , b.* from data as a LEFT JOIN (select * from tdays ) as b on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");
我得到了以下输出:
+----------------------+---------------------+--------------------+--------------------+
|id_device |tracking_time | b.id_device |b.tracking_time |
+----------------------+---------------------+--------------------+--------------------+
|20 |2020-02-19 02:37:45 |20 | 2019-02-19 02:41:45|
|20 |2020-02-19 02:37:45 |20 | 2020-01-17 17:15:45|
|5 |2020-02-17 17:15:45 |null |null |
+-----------------------+--------------------+--------------------+--------------------+
我想要的是将第一个数据帧与左连接结果连接,“按df2.tracking_time降序排序并限制为1”。
我需要您的帮助
英文:
I have two dataframes :
df1:
+--------------+---------------------+
|id_device |tracking_time |
+--------------+---------------------+
|20 |2020-02-19 02:37:45 |
|5 |2020-02-17 17:15:45 |
+--------------+---------------------+
df2
+--------------+----------------------+
|id_device |tracking_time |
+--------------+----------------------+
|20 | 2019-02-19 02:41:45 |
|20 |2020-01-17 17:15:45 |
+--------------+----------------------+
I want to get the following output :
+--------------+---------------------+------------------+
|id_device |tracking_time | df2.tracking_time |
+--------------+---------------------+------------------+
|20 |2020-02-19 02:37:45 |2019-02-19 02:41:45|
|5 |2020-02-17 17:15:45 |null |
+--------------+---------------------+-------------------+
I tried the following code :
df1.registerTempTable("data");
df2.createOrReplaceTempView("tdays");
Dataset<Row> d_f = sparkSession.sql("select a.* , b.* from data as a LEFT JOIN (select * from tdays ) as b on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");
I get the following output :
+----------------------+---------------------+--------------------+------------------ -+
|id_device |tracking_time | b.id_device |b.tracking_time |
+----------------------+---------------------+--------------------+--------------------+
|20 |2020-02-19 02:37:45 |20 | 2019-02-19 02:41:45|
|20 |2020-02-19 02:37:45 |20 | 2020-01-17 17:15:45|
|5 |2020-02-17 17:15:45 |null |null |
+-----------------------+--------------------+--------------------+--------------------+
What I want is to join the first dataframe with result of left join ordered by df2.tracking_time desc limit 1
I need your help
答案1
得分: 1
在连接操作之前,您可以将 df2
缩减为每个 id_device
的最小日期:
val df1 = ...
val df2 = ...
val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
val result = df1.join(df2min, Seq("id_device"), "left")
df2min
只包含一个包含来自 df2
的最小日期的行,因此左连接将返回预期的结果。
英文:
Before the join, you can reduce df2
to the minimum dates for each id_device
:
val df1 = ...
val df2 = ...
val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
val result = df1.join(df2min, Seq("id_device"), "left")
df2min
contains only a single row with the minimum date from df2
per id. Therefore the left join will return the expected result.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论