英文:
Spark java DataFrame Date filter based on max Date another DataFrame
问题
我有两个数据框架
-
从一个数据框架中获取日期列的最大值 :: 一个列,一行 - df1,列:maxdate
-
具有日期列的多个记录 :: df2列:col1,col2,col3..colDate
我想要根据 df1.maxdate 过滤 df2,df2.colDate > df1.maxdate
如果我像下面这样指定,它可以工作。
df2.filter(col("colDate").gt(lit(2020-01-01)))
但是,我无法使用 df1.maxdate。
我正在尝试使用 Java 来实现这个解决方案。
两个数据框架列的数据类型都是日期
我试图通过 Spark 转换来实现这个目标
select * from a
where a.col > (select max(b.col) from b)
在我的示例中
表 a = df2
表 b = df1
英文:
I have two dataframes
-
Max of date column from one dataframe :: one column, one row - df1, column : maxdate
-
Multiple records having date column :: df2 columns : col1,col2,col3..colDate
I want filter df2 based df1.maxdate, df2.colDate > df1.maxdate
If I specify like below then its working.
df2.filter(col("colDate").gt(lit(2020-01-01)))
However, I'm not able to use df1.maxdate.
I'm trying java to achieve this soulution.
DataType is date in both dataFrame columns
I m trying to achieve this through spark transformation
select * from a
where a.col > (select max(b.col) from b)
In my example
Table a = df2
Table b = df1
答案1
得分: 1
以下是翻译好的部分:
val df1 = Seq(("2020-01-02")).toDF("Maxate")
df1.show()
/*
+----------+
| Maxate|
+----------+
|2020-01-02|
+----------+
*/
val df2 = Seq(("2020-01-01","A","B"),("2020-01-03","C","D")).toDF("colDate","col1","col2")
/*
+----------+----+----+
| colDate|col1|col2|
+----------+----+----+
|2020-01-01| A| B|
|2020-01-03| C| D|
+----------+----+----+
*/
val maxDate=df1.collect.map(row=>row.getString(0)).mkString
df2.filter($"colDate">maxDate).show()
/*
+----------+----+----+
| colDate|col1|col2|
+----------+----+----+
|2020-01-03| C| D|
+----------+----+----+
*/
英文:
the below code might be helpful for you,
val df1 = Seq(('2020-01-02')).toDF("Maxate")
df1.show()
/*
+----------+
| Maxate|
+----------+
|2020-01-02|
+----------+
*/
val df2 = Seq(("2020-01-01","A","B"),("2020-01-03","C","D")).toDF("colDate","col1","col2")
/*
+----------+----+----+
| colDate|col1|col2|
+----------+----+----+
|2020-01-01| A| B|
|2020-01-03| C| D|
+----------+----+----+
*/
val maxDate=df1.collect.map(row=>row.getString(0)).mkString
df2.filter($"colDate">maxDate).show()
/*
+----------+----+----+
| colDate|col1|col2|
+----------+----+----+
|2020-01-03| C| D|
+----------+----+----+
*/
</details>
# 答案2
**得分**: 0
Sure, here are the translated parts:
**`createTempView`** on `two dataframes` then using **sql query** we can filter the only required date.
**`示例:`**
**`选项1:使用createTempView:`**
```scala
df1.show()
//+----------+
//| Maxdate|
//+----------+
//|2020-01-01|
//+----------+
df2.show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-01| A| B|
//|2020-01-03| C| D|
//+----------+----+----+
df1.createOrReplaceTempView("tmp")
df2.createOrReplaceTempView("tmp1")
sql("select * from tmp1 where coldate > (select maxdate from tmp)").show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-03| C| D|
//+----------+----+----+
选项-2:使用dataframe变量:
另一种方法是存储到变量,然后使用该变量在dataframe中使用**filter
**。
val max_val = df1.collect()(0)(0).toString
df2.filter(col("colDate") > max_val).show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-03| C| D|
//+----------+----+----+
选项-3:使用dataframe crossJoin和expr:
在这种情况下,我们不创建变量,而是使用dataframe列来过滤只需要的行。
df2.crossJoin(df1).
filter(expr("colDate > Maxdate")).
drop("Maxdate").
show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-03| C| D|
//+----------+----+----+
英文:
createTempView
on two dataframes
then using sql query we can filter the only required date.
Example:
Option1: using createTempView:
df1.show()
//+----------+
//| Maxdate|
//+----------+
//|2020-01-01|
//+----------+
df2.show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-01| A| B|
//|2020-01-03| C| D|
//+----------+----+----+
df1.createOrReplaceTempView("tmp")
df2.createOrReplaceTempView("tmp1")
sql("select * from tmp1 where coldate > (select maxdate from tmp)").show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-03| C| D|
//+----------+----+----+
Option-2:Using dataframe variable:
Another way would be storing into variable then using the variable then use the variable in dataframe filter
.
val max_val=df1.collect()(0)(0).toString
df2.filter(col("colDate") > max_val).show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-03| C| D|
//+----------+----+----+
Option-3:Using dataframe crossJoin and expr:
In this case we are not creating variable instead using dataframe column to filter only the required rows.
df2.crossJoin(df1).
filter(expr("colDate > Maxdate")).
drop("Maxdate").
show()
//+----------+----+----+
//| colDate|col1|col2|
//+----------+----+----+
//|2020-01-03| C| D|
//+----------+----+----+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论