英文:
Is it faster to cast within filter() or cast new withColumn(), then filter in Spark?
问题
Here are the translated parts:
直接回答您的问题:
以下是要翻译的内容:
直接回答您的问题:这里是已翻译的部分:
Pretty straight forward. Is it faster to run
Pretty straight forward. Is it faster to run
df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )
df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )
or
or
df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
.filter("date_col_cast_timestamp" >= lit(my_timestamp) )
df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
.filter("date_col_cast_timestamp" >= lit(my_timestamp) )
or one more...
or one more...
df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
.withColumn("my_timestamp", lit(my_timestamp)) \
.filter("date_col_cast_timestamp" >= col("my_timestamp"))
df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
.withColumn("my_timestamp", lit(my_timestamp)) \
.filter("date_col_cast_timestamp" >= col("my_timestamp"))
I won't provide explanations or answer translation-related questions as per your request.
英文:
Pretty straight forward. Is it faster to run
df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )
or
df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
.filter("date_col_cast_timestamp" >= lit(my_timestamp) )
or one more...
df.withColumn("date_col_cast_timestamp", col("date_col").cast("timestamp") ) \
.withColumn("my_timestamp", lit(my_timestamp)) \
.filter("date_col_cast_timestamp" >= col("my_timestamp"))
If you could explain why that would be greatly appreciated. I'm not the best with understanding spark and when I tried doing a filter(cast >= lit) I noticed thousands of tasks were created compared to a couple hundred. Not sure if that's better or worse. I couldn't tell if it was slower or faster though.
答案1
得分: 1
最好的比较不同解决方案的方法是使用 explain
函数并比较计划。
在您的情况下,我认为第一种解决方案更好:
df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp))
在另一种解决方案中,您使用了 withColumn
来创建一个全新的列,这在第一种解决方案中并非如此。
如果您将来希望多次使用日期和时间戳列,这个解决方案也是可以接受的。
英文:
The best way to compare different differents solution is to use explain
fonction and compare the plan.
In your case, I think that the first solution is better
df.filter(col("date_col").cast("timestamp") >= lit(my_timestamp) )
In other solution you are using withColumn
that create a whole new column, that is not the case in your first solution.
This solution can be acceptable if you want, in the future, to use both, date and timestamps column multiple time.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论