2023年7月20日 19:45:02go评论145阅读模式

英文:

What is the best approach to display PySpark DataFrame without re-executing the logic each time we display?

问题

我有一个在Databricks笔记本中定义的PySpark DataFrame，并对该DataFrame应用了不同的转换。我想要在多次转换后显示DataFrame以检查结果。

然而，根据参考资料，每次尝试显示结果时，都会重新运行执行计划。在参考资料中提出了一种解决方案，即将DataFrame保存然后重新加载。然而，这种解决方案无法应用于我正在使用的平台。

是否有其他解决方案可以在笔记本中多次显示结果而不重新执行逻辑？

我可以像下面这样使用.cache()来实现吗：

df.cache().count()
df.display()

然而，我仍然不确定使用缓存是否可以避免重新计算所有的转换。

在参考资料中介绍了另一个解决方案：

当你缓存一个DataFrame时，为它创建一个新变量 cachedDF = df.cache()。
这将使你能够绕过我们在示例中解决的问题，有时分析计划和实际缓存的内容不太清楚。
每当你调用 cachedDF.select(...) 时，它将利用缓存的数据。

我没有很好地理解其背后的逻辑以及它是否有助于避免重新计算所有的转换。

英文:

I have a PySpark DataFrame (defined in a notebook in Databricks) and different transformations are applied on the DataFrame. I want to display DataFrame after several transformations to check the results.

However, according to the Reference, every time I try to display results, it runs the execution plan again. A solution has been proposed in the reference by saving the DataFrame and then loading it. However, this solution cannot be applied to the platform I am working on.

Is there any other solution to display results a few times in a notebook without re-executing the logic?

Can I use .cache() for this purpose as below:

df.cache().count()
df.display()

However, I am not still sure that using caching avoids recomputing the entire transformations.

Another solution is introduced in the Reference:

When you cache a DataFrame create a new variable for it cachedDF = df.cache().
This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. 
Here whenever you call cachedDF.select(…) it will leverage the cached data.

I didn't understand well the logic behind it and whether it helps to avoid recomputing the entire transformations.

答案1

得分: 1

Spark中的Dataframes是不可变的，cache方法也不例外 - 它不会将Dataframe缓存起来，而是返回已缓存的Dataframe，与你引用的参考资料中一样：

cachedDF = df.cache()

因此，当你运行df.display()时，你正在显示原始的未缓存的Dataframe，而不是已缓存的Dataframe。

你需要这样编写：

cachedDF = df.cache()
cachedDF.count()
cachedDF.display()

以便在已缓存的Dataframe上同时运行count()和display()。

请注意，从缓存中受益的唯一方法是将对已缓存的Dataframe的引用保存在一个变量中。以下方式仍然是错误的：

df.cache().count()
df.cache().display()

它会导致如字面所述的操作 - 计算df，将其缓存，然后计数，然后再次重新计算df，再次缓存，最后显示。

英文:

Dataframes in Spark are immutable and cache method is no exception - it does not make the dataframe cached - instead it returns the cached dataframe, exactly as in the reference you quoted:

cachedDF = df.cache()

So when you ran df.display() - you are displaying the original uncached dataframe, not the cached one.

You need to write:

cachedDF = df.cache()
cachedDF.count()
cachedDF.display()

in order to run count() and display() both on the cached dataframe.

Note that the only way to benefit from caching is to save the reference to cached dataframe in a variable. This would still be wrong:

df.cache().count()
df.cache().display()

It would cause - exactly as written - calculation of df, caching it, then count, and after that once more recalculation of df, cache again, then display.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

最佳方法是在显示 PySpark DataFrame 时，避免每次重新执行逻辑。

问题

答案1

Data frame indexing not working as it should be. Does not give error as well. Pandas-Python.

向多级列数据框添加条件列

处理我的数据框，使用条件 – Python Jupyter 笔记本

Laravel：如何缓存从模型关系中检索到的数据？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论