2020年8月11日 02:07:21go评论85阅读模式

英文:

Spark java DataFrame Date filter based on max Date another DataFrame

问题

我有两个数据框架

从一个数据框架中获取日期列的最大值 :: 一个列，一行 - df1，列：maxdate
具有日期列的多个记录 :: df2列：col1，col2，col3..colDate

我想要根据 df1.maxdate 过滤 df2，df2.colDate > df1.maxdate

如果我像下面这样指定，它可以工作。

df2.filter(col("colDate").gt(lit(2020-01-01)))

但是，我无法使用 df1.maxdate。
我正在尝试使用 Java 来实现这个解决方案。

两个数据框架列的数据类型都是日期

我试图通过 Spark 转换来实现这个目标

select * from a 
where a.col > (select max(b.col) from b)

在我的示例中

表 a = df2
表 b = df1

英文:

I have two dataframes

Max of date column from one dataframe :: one column, one row - df1, column : maxdate
Multiple records having date column :: df2 columns : col1,col2,col3..colDate

I want filter df2 based df1.maxdate, df2.colDate > df1.maxdate

If I specify like below then its working.

df2.filter(col(&quot;colDate&quot;).gt(lit(2020-01-01)))

However, I'm not able to use df1.maxdate.
I'm trying java to achieve this soulution.

DataType is date in both dataFrame columns

I m trying to achieve this through spark transformation

select * from a 
where a.col &gt; (select max(b.col) from b)

In my example

Table a = df2
Table b = df1

答案1

得分: 1

以下是翻译好的部分：

val df1 = Seq(("2020-01-02")).toDF("Maxate")

df1.show()

/*
+----------+
|    Maxate|
+----------+
|2020-01-02|
+----------+
*/

val df2 = Seq(("2020-01-01","A","B"),("2020-01-03","C","D")).toDF("colDate","col1","col2")

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-01|   A|   B|
|2020-01-03|   C|   D|
+----------+----+----+
*/
val maxDate=df1.collect.map(row=>row.getString(0)).mkString

df2.filter($"colDate">maxDate).show()

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-03|   C|   D|
+----------+----+----+
*/

英文:

the below code might be helpful for you,

val df1 = Seq((&#39;2020-01-02&#39;)).toDF(&quot;Maxate&quot;)

df1.show()

/*
+----------+
|    Maxate|
+----------+
|2020-01-02|
+----------+
*/

val df2 = Seq((&quot;2020-01-01&quot;,&quot;A&quot;,&quot;B&quot;),(&quot;2020-01-03&quot;,&quot;C&quot;,&quot;D&quot;)).toDF(&quot;colDate&quot;,&quot;col1&quot;,&quot;col2&quot;)

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-01|   A|   B|
|2020-01-03|   C|   D|
+----------+----+----+
*/
val maxDate=df1.collect.map(row=&gt;row.getString(0)).mkString

df2.filter($&quot;colDate&quot;&gt;maxDate).show()

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-03|   C|   D|
+----------+----+----+
*/

</details>



# 答案2
**得分**: 0

Sure, here are the translated parts:

**`createTempView`** on `two dataframes` then using **sql query** we can filter the only required date.

**`示例:`**

**`选项1：使用createTempView：`**

```scala
df1.show()
//+----------+
//|   Maxdate|
//+----------+
//|2020-01-01|
//+----------+

df2.show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-01|   A|   B|
//|2020-01-03|   C|   D|
//+----------+----+----+

df1.createOrReplaceTempView("tmp")

df2.createOrReplaceTempView("tmp1")

sql("select * from tmp1 where coldate > (select maxdate from tmp)").show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

选项-2：使用dataframe变量：

另一种方法是存储到变量，然后使用该变量在dataframe中使用**filter**。

val max_val = df1.collect()(0)(0).toString

df2.filter(col("colDate") > max_val).show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

选项-3：使用dataframe crossJoin和expr：

在这种情况下，我们不创建变量，而是使用dataframe列来过滤只需要的行。

df2.crossJoin(df1).
filter(expr("colDate > Maxdate")).
drop("Maxdate").
show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

英文:

createTempView on two dataframes then using sql query we can filter the only required date.

Example:

Option1: using createTempView:

df1.show()
//+----------+
//|   Maxdate|
//+----------+
//|2020-01-01|
//+----------+

df2.show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-01|   A|   B|
//|2020-01-03|   C|   D|
//+----------+----+----+


df1.createOrReplaceTempView(&quot;tmp&quot;)

df2.createOrReplaceTempView(&quot;tmp1&quot;)

sql(&quot;select * from tmp1 where coldate &gt; (select maxdate from tmp)&quot;).show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

Option-2:Using dataframe variable:

Another way would be storing into variable then using the variable then use the variable in dataframe filter.

val max_val=df1.collect()(0)(0).toString

df2.filter(col(&quot;colDate&quot;) &gt; max_val).show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

Option-3:Using dataframe crossJoin and expr:

In this case we are not creating variable instead using dataframe column to filter only the required rows.

df2.crossJoin(df1).
filter(expr(&quot;colDate &gt; Maxdate&quot;)).
drop(&quot;Maxdate&quot;).
show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据另一个DataFrame的最大日期筛选Spark Java DataFrame日期。

问题

答案1

如何从自定义类内部更新进度

Java JDBC中创建CachedRowSet导致空指针异常。

在@RequestBody对象中存储@PathVariable的值，是一种好的做法吗？

在Java中如何使用System.lineseparator和逗号来拆分字符串？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论