最佳方法是在显示 PySpark DataFrame 时,避免每次重新执行逻辑。

huangapple go评论65阅读模式
英文:

What is the best approach to display PySpark DataFrame without re-executing the logic each time we display?

问题

我有一个在Databricks笔记本中定义的PySpark DataFrame,并对该DataFrame应用了不同的转换。我想要在多次转换后显示DataFrame以检查结果。

然而,根据参考资料,每次尝试显示结果时,都会重新运行执行计划。在参考资料中提出了一种解决方案,即将DataFrame保存然后重新加载。然而,这种解决方案无法应用于我正在使用的平台。

是否有其他解决方案可以在笔记本中多次显示结果而不重新执行逻辑?

我可以像下面这样使用.cache()来实现吗:

df.cache().count()
df.display()

然而,我仍然不确定使用缓存是否可以避免重新计算所有的转换。

参考资料中介绍了另一个解决方案:

当你缓存一个DataFrame时为它创建一个新变量 cachedDF = df.cache()
这将使你能够绕过我们在示例中解决的问题有时分析计划和实际缓存的内容不太清楚
每当你调用 cachedDF.select(...) 时它将利用缓存的数据

我没有很好地理解其背后的逻辑以及它是否有助于避免重新计算所有的转换。

英文:

I have a PySpark DataFrame (defined in a notebook in Databricks) and different transformations are applied on the DataFrame. I want to display DataFrame after several transformations to check the results.

However, according to the Reference, every time I try to display results, it runs the execution plan again. A solution has been proposed in the reference by saving the DataFrame and then loading it. However, this solution cannot be applied to the platform I am working on.

Is there any other solution to display results a few times in a notebook without re-executing the logic?

Can I use .cache() for this purpose as below:

df.cache().count()
df.display()

However, I am not still sure that using caching avoids recomputing the entire transformations.

Another solution is introduced in the Reference:

When you cache a DataFrame create a new variable for it cachedDF = df.cache().
This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. 
Here whenever you call cachedDF.select(…) it will leverage the cached data.

I didn't understand well the logic behind it and whether it helps to avoid recomputing the entire transformations.

答案1

得分: 1

Spark中的Dataframes是不可变的,cache方法也不例外 - 它不会将Dataframe缓存起来,而是返回已缓存的Dataframe,与你引用的参考资料中一样:

cachedDF = df.cache()

因此,当你运行df.display()时,你正在显示原始的未缓存的Dataframe,而不是已缓存的Dataframe。

你需要这样编写:

cachedDF = df.cache()
cachedDF.count()
cachedDF.display()

以便在已缓存的Dataframe上同时运行count()display()

请注意,从缓存中受益的唯一方法是将对已缓存的Dataframe的引用保存在一个变量中。以下方式仍然是错误的:

df.cache().count()
df.cache().display()

它会导致如字面所述的操作 - 计算df,将其缓存,然后计数,然后再次重新计算df,再次缓存,最后显示。

英文:

Dataframes in Spark are immutable and cache method is no exception - it does not make the dataframe cached - instead it returns the cached dataframe, exactly as in the reference you quoted:

cachedDF = df.cache()

So when you ran df.display() - you are displaying the original uncached dataframe, not the cached one.

You need to write:

cachedDF = df.cache()
cachedDF.count()
cachedDF.display()

in order to run count() and display() both on the cached dataframe.

Note that the only way to benefit from caching is to save the reference to cached dataframe in a variable. This would still be wrong:

df.cache().count()
df.cache().display()

It would cause - exactly as written - calculation of df, caching it, then count, and after that once more recalculation of df, cache again, then display.

huangapple
  • 本文由 发表于 2023年7月20日 19:45:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729513.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定