根据另一个DataFrame的最大日期筛选Spark Java DataFrame日期。

huangapple go评论85阅读模式
英文:

Spark java DataFrame Date filter based on max Date another DataFrame

问题

我有两个数据框架

  1. 从一个数据框架中获取日期列的最大值 :: 一个列,一行 - df1,列:maxdate

  2. 具有日期列的多个记录 :: df2列:col1,col2,col3..colDate

我想要根据 df1.maxdate 过滤 df2,df2.colDate > df1.maxdate

如果我像下面这样指定,它可以工作。

df2.filter(col("colDate").gt(lit(2020-01-01)))

但是,我无法使用 df1.maxdate。
我正在尝试使用 Java 来实现这个解决方案。

两个数据框架列的数据类型都是日期

我试图通过 Spark 转换来实现这个目标

select * from a 
where a.col > (select max(b.col) from b)

在我的示例中

表 a = df2
表 b = df1

英文:

I have two dataframes

  1. Max of date column from one dataframe :: one column, one row - df1, column : maxdate

  2. Multiple records having date column :: df2 columns : col1,col2,col3..colDate

I want filter df2 based df1.maxdate, df2.colDate > df1.maxdate

If I specify like below then its working.

df2.filter(col("colDate").gt(lit(2020-01-01)))

However, I'm not able to use df1.maxdate.
I'm trying java to achieve this soulution.

DataType is date in both dataFrame columns

I m trying to achieve this through spark transformation

select * from a 
where a.col > (select max(b.col) from b)

In my example

Table a = df2
Table b = df1

答案1

得分: 1

以下是翻译好的部分:

val df1 = Seq(("2020-01-02")).toDF("Maxate")

df1.show()

/*
+----------+
|    Maxate|
+----------+
|2020-01-02|
+----------+
*/

val df2 = Seq(("2020-01-01","A","B"),("2020-01-03","C","D")).toDF("colDate","col1","col2")

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-01|   A|   B|
|2020-01-03|   C|   D|
+----------+----+----+
*/
val maxDate=df1.collect.map(row=>row.getString(0)).mkString

df2.filter($"colDate">maxDate).show()

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-03|   C|   D|
+----------+----+----+
*/
英文:

the below code might be helpful for you,

val df1 = Seq(('2020-01-02')).toDF("Maxate")

df1.show()

/*
+----------+
|    Maxate|
+----------+
|2020-01-02|
+----------+
*/

val df2 = Seq(("2020-01-01","A","B"),("2020-01-03","C","D")).toDF("colDate","col1","col2")

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-01|   A|   B|
|2020-01-03|   C|   D|
+----------+----+----+
*/
val maxDate=df1.collect.map(row=>row.getString(0)).mkString

df2.filter($"colDate">maxDate).show()

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-03|   C|   D|
+----------+----+----+
*/

</details>



# 答案2
**得分**: 0

Sure, here are the translated parts:

**`createTempView`** on `two dataframes` then using **sql query** we can filter the only required date.

**`示例:`**

**`选项1:使用createTempView:`**

```scala
df1.show()
//+----------+
//|   Maxdate|
//+----------+
//|2020-01-01|
//+----------+

df2.show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-01|   A|   B|
//|2020-01-03|   C|   D|
//+----------+----+----+

df1.createOrReplaceTempView("tmp")

df2.createOrReplaceTempView("tmp1")

sql("select * from tmp1 where coldate > (select maxdate from tmp)").show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

选项-2:使用dataframe变量:

另一种方法是存储到变量,然后使用该变量在dataframe中使用**filter**。

val max_val = df1.collect()(0)(0).toString

df2.filter(col("colDate") > max_val).show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

选项-3:使用dataframe crossJoin和expr:

在这种情况下,我们不创建变量,而是使用dataframe列来过滤只需要的行。

df2.crossJoin(df1).
filter(expr("colDate > Maxdate")).
drop("Maxdate").
show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+
英文:

createTempView on two dataframes then using sql query we can filter the only required date.

Example:

Option1: using createTempView:

df1.show()
//+----------+
//|   Maxdate|
//+----------+
//|2020-01-01|
//+----------+

df2.show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-01|   A|   B|
//|2020-01-03|   C|   D|
//+----------+----+----+


df1.createOrReplaceTempView(&quot;tmp&quot;)

df2.createOrReplaceTempView(&quot;tmp1&quot;)

sql(&quot;select * from tmp1 where coldate &gt; (select maxdate from tmp)&quot;).show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

Option-2:Using dataframe variable:

Another way would be storing into variable then using the variable then use the variable in dataframe filter.

val max_val=df1.collect()(0)(0).toString

df2.filter(col(&quot;colDate&quot;) &gt; max_val).show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

Option-3:Using dataframe crossJoin and expr:

In this case we are not creating variable instead using dataframe column to filter only the required rows.

df2.crossJoin(df1).
filter(expr(&quot;colDate &gt; Maxdate&quot;)).
drop(&quot;Maxdate&quot;).
show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

huangapple
  • 本文由 发表于 2020年8月11日 02:07:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/63345697.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定