Is it faster to cast within filter() or cast new withColumn(), then filter in Spark?

huangapple go评论77阅读模式
英文:

Is it faster to cast within filter() or cast new withColumn(), then filter in Spark?

问题

Here are the translated parts:

直接回答您的问题:

以下是要翻译的内容:

直接回答您的问题:这里是已翻译的部分:

Pretty straight forward. Is it faster to run 

Pretty straight forward. Is it faster to run 

df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )

df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )

or

or

df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
  .filter("date_col_cast_timestamp" >= lit(my_timestamp) )

df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
  .filter("date_col_cast_timestamp" >= lit(my_timestamp) )

or one more...

or one more...

df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
  .withColumn("my_timestamp", lit(my_timestamp)) \
  .filter("date_col_cast_timestamp" >= col("my_timestamp"))

df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
  .withColumn("my_timestamp", lit(my_timestamp)) \
  .filter("date_col_cast_timestamp" >= col("my_timestamp"))

I won't provide explanations or answer translation-related questions as per your request.

英文:

Pretty straight forward. Is it faster to run

df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )

or

df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
  .filter("date_col_cast_timestamp" >= lit(my_timestamp) )

or one more...

df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
  .withColumn("my_timestamp", lit(my_timestamp)) \
  .filter("date_col_cast_timestamp" >= col("my_timestamp"))

If you could explain why that would be greatly appreciated. I'm not the best with understanding spark and when I tried doing a filter(cast >= lit) I noticed thousands of tasks were created compared to a couple hundred. Not sure if that's better or worse. I couldn't tell if it was slower or faster though.

答案1

得分: 1

最好的比较不同解决方案的方法是使用 explain 函数并比较计划。

在您的情况下,我认为第一种解决方案更好:

df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp))

在另一种解决方案中,您使用了 withColumn 来创建一个全新的列,这在第一种解决方案中并非如此。

如果您将来希望多次使用日期和时间戳列,这个解决方案也是可以接受的。

英文:

The best way to compare different differents solution is to use explain fonction and compare the plan.

In your case, I think that the first solution is better

df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )

In other solution you are using withColumn that create a whole new column, that is not the case in your first solution.

This solution can be acceptable if you want, in the future, to use both, date and timestamps column multiple time.

huangapple
  • 本文由 发表于 2023年6月29日 14:25:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76578508.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定