如何使用unpersist删除RDD

huangapple go评论62阅读模式
英文:

How to delete RDD with unpersist

问题

I'm trying to understand how RDD.unpersist() works but I'm running into some confusing output.

When I force-delete an RDD and then try to show it:

rdd.unpersist(blocking=True)
rdd.show() # why doesn't this line throw an error?

I expect the second line to error, but it doesn't. The RDD prints out as usual.

I saw these two questions: Why doesnt spark unload memory even with unpersist, How to make sure my DataFrame frees its memory?. They were both helpful in understanding how to use unpersist but don't answer my question.

I'm using a Jupyter notebook and wondered if the notebook might be caching the RDD, so I tested this out in a .py file as well, and the same thing happened.

If the RDD has been deleted, why does it print when show() is called on it?

If it hasn't been deleted, how can I delete it?

英文:

I'm trying to understand how RDD.unpersist() works but I'm running into some confusing output.

When I force-delete an RDD and then try to show it:

rdd.unpersist(blocking=True)
rdd.show() # why doesn't this line throw an error?

I expect the second line to error, but it doesn't. The RDD prints out as usual.

I saw these two questions: Why doesnt spark unload memory even with unpersist, How to make sure my DataFrame frees its memory?. They were both helpful in understanding how to use unpersist but don't answer my question.

I'm using a Jupyter notebook and wondered if the notebook might be caching the RDD, so I tested this out in a .py file as well, and the same thing happened.

If the RDD has been deleted, why does it print when show() is called on it?

If it hasn't been deleted, how can I delete it?

答案1

得分: 1

你可以使用属性is_cached来检查RDD是否被缓存。

rdd = sc.parallelize([1, 2, 3])
print(rdd.is_cached)
rdd.cache()
print(rdd.is_cached)
rdd.unpersist()
print(rdd.is_cached)
rdd.count() -- 忽略响应
print(rdd.is_cached)

False
True
False
False

如果不调用cache,RDD将不会被缓存,而unpersist不会按你的期望工作。它不会删除RDD本身,只会删除缓存的版本。如何删除一个变量?

英文:

You can check the rdd is cached or not by using the attribute is_cached.

rdd = sc.parallelize([1, 2, 3])
print(rdd.is_cached)
rdd.cache()
print(rdd.is_cached)
rdd.unpersist()
print(rdd.is_cached)
rdd.count() -- ignore response
print(rdd.is_cached)

False
True
False
False

Without cache call, the rdd is not cached and unpersist is not working as you expected. It will not delete rdd itself, just remove cached version. How do you delete a variable?

huangapple
  • 本文由 发表于 2023年2月8日 16:01:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/75382809.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定