将两个数据框按照限制一个数据框的行数进行连接。

huangapple go评论92阅读模式
英文:

Join two dataframes with limiting the rows of one dataframe

问题

我有两个数据帧:

    df1:
    +--------------+---------------------+
    |id_device     |tracking_time        |
    +--------------+---------------------+
    |20            |2020-02-19 02:37:45  |
    |5             |2020-02-17 17:15:45  |
    +--------------+---------------------+

    df2:
    +--------------+----------------------+
    |id_device     |tracking_time         |
    +--------------+----------------------+
    |20            | 2019-02-19 02:41:45  |
    |20            |2020-01-17 17:15:45   |
    +--------------+----------------------+

我想要获得以下输出:

    +--------------+---------------------+------------------+
    |id_device     |tracking_time        | df2.tracking_time |
    +--------------+---------------------+------------------+
    |20            |2020-02-19 02:37:45  |2019-02-19 02:41:45|
    |5             |2020-02-17 17:15:45  |null              |
    +--------------+---------------------+------------------+

我尝试了以下代码:

    df1.registerTempTable("data");
    df2.createOrReplaceTempView("tdays");     
    Dataset<Row> d_f = sparkSession.sql("select a.* , b.*  from data as a  LEFT JOIN (select  * from tdays ) as b  on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");

我得到了以下输出:

    +----------------------+---------------------+--------------------+--------------------+
    |id_device             |tracking_time        | b.id_device        |b.tracking_time     |
    +----------------------+---------------------+--------------------+--------------------+
    |20                     |2020-02-19 02:37:45 |20                  | 2019-02-19 02:41:45|
    |20                     |2020-02-19 02:37:45 |20                  | 2020-01-17 17:15:45|
    |5                      |2020-02-17 17:15:45 |null                |null                |
    +-----------------------+--------------------+--------------------+--------------------+

我想要的是将第一个数据帧与左连接结果连接,“按df2.tracking_time降序排序并限制为1”。

我需要您的帮助
英文:

I have two dataframes :

df1:
+--------------+---------------------+
|id_device     |tracking_time        |
+--------------+---------------------+
|20            |2020-02-19 02:37:45  |
|5             |2020-02-17 17:15:45  |
+--------------+---------------------+



df2
+--------------+----------------------+
|id_device     |tracking_time         |
+--------------+----------------------+
|20            | 2019-02-19 02:41:45  |
|20            |2020-01-17 17:15:45   |
+--------------+----------------------+

I want to get the following output :

+--------------+---------------------+------------------+
|id_device     |tracking_time        | df2.tracking_time |
+--------------+---------------------+------------------+
|20            |2020-02-19 02:37:45  |2019-02-19 02:41:45|
|5             |2020-02-17 17:15:45  |null               |
+--------------+---------------------+-------------------+

I tried the following code :

df1.registerTempTable(&quot;data&quot;);
    df2.createOrReplaceTempView(&quot;tdays&quot;);     
Dataset&lt;Row&gt; d_f = sparkSession.sql(&quot;select a.* , b.*  from data as a  LEFT JOIN (select  * from tdays ) as b  on b.id_device == a.id_device and b.tracking_time &lt; a.tracking_time &quot;);

I get the following output :

+----------------------+---------------------+--------------------+------------------ -+
|id_device             |tracking_time        | b.id_device        |b.tracking_time     |
+----------------------+---------------------+--------------------+--------------------+
|20                     |2020-02-19 02:37:45 |20                  | 2019-02-19 02:41:45|
|20                     |2020-02-19 02:37:45 |20                  | 2020-01-17 17:15:45|
|5                      |2020-02-17 17:15:45 |null                |null                |
+-----------------------+--------------------+--------------------+--------------------+

What I want is to join the first dataframe with result of left join ordered by df2.tracking_time desc limit 1

I need your help

答案1

得分: 1

在连接操作之前,您可以将 df2 缩减为每个 id_device 的最小日期:

val df1 = ...
val df2 = ...
val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
val result = df1.join(df2min, Seq("id_device"), "left")

df2min 只包含一个包含来自 df2 的最小日期的行,因此左连接将返回预期的结果。

英文:

Before the join, you can reduce df2 to the minimum dates for each id_device:

val df1 = ...
val df2 = ...
val df2min = df2.groupBy(&quot;id_device&quot;).agg(min(&quot;tracking_time&quot;)).as(&quot;df2.tracking_time&quot;)
val result = df1.join(df2min, Seq(&quot;id_device&quot;), &quot;left&quot;)

df2min contains only a single row with the minimum date from df2 per id. Therefore the left join will return the expected result.

huangapple
  • 本文由 发表于 2020年8月23日 18:29:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/63545901.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定