2020年8月23日 18:29:57go评论119阅读模式

英文:

Join two dataframes with limiting the rows of one dataframe

问题

我有两个数据帧：
    df1：
    +--------------+---------------------+
    |id_device     |tracking_time        |
    +--------------+---------------------+
    |20            |2020-02-19 02:37:45  |
    |5             |2020-02-17 17:15:45  |
    +--------------+---------------------+
    df2：
    +--------------+----------------------+
    |id_device     |tracking_time         |
    +--------------+----------------------+
    |20            | 2019-02-19 02:41:45  |
    |20            |2020-01-17 17:15:45   |
    +--------------+----------------------+
我想要获得以下输出：
    +--------------+---------------------+------------------+
    |id_device     |tracking_time        | df2.tracking_time |
    +--------------+---------------------+------------------+
    |20            |2020-02-19 02:37:45  |2019-02-19 02:41:45|
    |5             |2020-02-17 17:15:45  |null              |
    +--------------+---------------------+------------------+
我尝试了以下代码：
    df1.registerTempTable("data");
    df2.createOrReplaceTempView("tdays");     
    Dataset<Row> d_f = sparkSession.sql("select a.* , b.*  from data as a  LEFT JOIN (select  * from tdays ) as b  on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");
我得到了以下输出：
    +----------------------+---------------------+--------------------+--------------------+
    |id_device             |tracking_time        | b.id_device        |b.tracking_time     |
    +----------------------+---------------------+--------------------+--------------------+
    |20                     |2020-02-19 02:37:45 |20                  | 2019-02-19 02:41:45|
    |20                     |2020-02-19 02:37:45 |20                  | 2020-01-17 17:15:45|
    |5                      |2020-02-17 17:15:45 |null                |null                |
    +-----------------------+--------------------+--------------------+--------------------+
我想要的是将第一个数据帧与左连接结果连接，“按df2.tracking_time降序排序并限制为1”。
我需要您的帮助

英文:

I have two dataframes :

df1:
+--------------+---------------------+
|id_device     |tracking_time        |
+--------------+---------------------+
|20            |2020-02-19 02:37:45  |
|5             |2020-02-17 17:15:45  |
+--------------+---------------------+
df2
+--------------+----------------------+
|id_device     |tracking_time         |
+--------------+----------------------+
|20            | 2019-02-19 02:41:45  |
|20            |2020-01-17 17:15:45   |
+--------------+----------------------+

I want to get the following output :

+--------------+---------------------+------------------+
|id_device     |tracking_time        | df2.tracking_time |
+--------------+---------------------+------------------+
|20            |2020-02-19 02:37:45  |2019-02-19 02:41:45|
|5             |2020-02-17 17:15:45  |null               |
+--------------+---------------------+-------------------+

I tried the following code :

df1.registerTempTable(&quot;data&quot;);
    df2.createOrReplaceTempView(&quot;tdays&quot;);     
Dataset&lt;Row&gt; d_f = sparkSession.sql(&quot;select a.* , b.*  from data as a  LEFT JOIN (select  * from tdays ) as b  on b.id_device == a.id_device and b.tracking_time &lt; a.tracking_time &quot;);

I get the following output :

+----------------------+---------------------+--------------------+------------------ -+
|id_device             |tracking_time        | b.id_device        |b.tracking_time     |
+----------------------+---------------------+--------------------+--------------------+
|20                     |2020-02-19 02:37:45 |20                  | 2019-02-19 02:41:45|
|20                     |2020-02-19 02:37:45 |20                  | 2020-01-17 17:15:45|
|5                      |2020-02-17 17:15:45 |null                |null                |
+-----------------------+--------------------+--------------------+--------------------+

What I want is to join the first dataframe with result of left join ordered by df2.tracking_time desc limit 1

I need your help

答案1

得分: 1

在连接操作之前，您可以将 df2 缩减为每个 id_device 的最小日期：

val df1 = ...
val df2 = ...
val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
val result = df1.join(df2min, Seq("id_device"), "left")

df2min 只包含一个包含来自 df2 的最小日期的行，因此左连接将返回预期的结果。

英文:

Before the join, you can reduce df2 to the minimum dates for each id_device:

val df1 = ...
val df2 = ...
val df2min = df2.groupBy(&quot;id_device&quot;).agg(min(&quot;tracking_time&quot;)).as(&quot;df2.tracking_time&quot;)
val result = df1.join(df2min, Seq(&quot;id_device&quot;), &quot;left&quot;)

df2min contains only a single row with the minimum date from df2 per id. Therefore the left join will return the expected result.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将两个数据框按照限制一个数据框的行数进行连接。

问题

答案1

Error: Unable to initialize main class AirMail Caused by: java.lang.NoClassDefFoundError: javax/mail/Authenticator

将Spark SQL转换为Python Spark / Databricks管道事件日志。

在PostgreSQL中多次左连接出现空结果

HTTP 500 – 内部服务器错误。Servlet.init() 用于 servlet spring-dispatcher 抛出异常。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。