英文:
How to delete RDD with unpersist
问题
I'm trying to understand how RDD.unpersist()
works but I'm running into some confusing output.
When I force-delete an RDD and then try to show it:
rdd.unpersist(blocking=True)
rdd.show() # why doesn't this line throw an error?
I expect the second line to error, but it doesn't. The RDD prints out as usual.
I saw these two questions: Why doesnt spark unload memory even with unpersist, How to make sure my DataFrame frees its memory?. They were both helpful in understanding how to use unpersist
but don't answer my question.
I'm using a Jupyter notebook and wondered if the notebook might be caching the RDD, so I tested this out in a .py file as well, and the same thing happened.
If the RDD has been deleted, why does it print when show()
is called on it?
If it hasn't been deleted, how can I delete it?
英文:
I'm trying to understand how RDD.unpersist()
works but I'm running into some confusing output.
When I force-delete an RDD and then try to show it:
rdd.unpersist(blocking=True)
rdd.show() # why doesn't this line throw an error?
I expect the second line to error, but it doesn't. The RDD prints out as usual.
I saw these two questions: Why doesnt spark unload memory even with unpersist, How to make sure my DataFrame frees its memory?. They were both helpful in understanding how to use unpersist
but don't answer my question.
I'm using a Jupyter notebook and wondered if the notebook might be caching the RDD, so I tested this out in a .py file as well, and the same thing happened.
If the RDD has been deleted, why does it print when show()
is called on it?
If it hasn't been deleted, how can I delete it?
答案1
得分: 1
你可以使用属性is_cached
来检查RDD是否被缓存。
rdd = sc.parallelize([1, 2, 3])
print(rdd.is_cached)
rdd.cache()
print(rdd.is_cached)
rdd.unpersist()
print(rdd.is_cached)
rdd.count() -- 忽略响应
print(rdd.is_cached)
False
True
False
False
如果不调用cache
,RDD将不会被缓存,而unpersist
不会按你的期望工作。它不会删除RDD本身,只会删除缓存的版本。如何删除一个变量?
英文:
You can check the rdd is cached or not by using the attribute is_cached
.
rdd = sc.parallelize([1, 2, 3])
print(rdd.is_cached)
rdd.cache()
print(rdd.is_cached)
rdd.unpersist()
print(rdd.is_cached)
rdd.count() -- ignore response
print(rdd.is_cached)
False
True
False
False
Without cache call, the rdd is not cached and unpersist is not working as you expected. It will not delete rdd itself, just remove cached version. How do you delete a variable?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论