问题

我是新手，正在尝试理解如何编写可以在分布式环境中执行的操作，而不是将大量数据集收集到驱动节点上。

我需要使用https://pypi.org/project/statsmodels/包对时间序列数据进行季节性分解。特别是函数seasonal_decompose可以操作类似数组的数据。如何在不首先将列作为普通Python数组收集到驱动节点上的情况下，对列中的所有数据运行此函数？

from statsmodels.api.tsa import seasonal_decompose

# df是一个PySpark dataframe
data = df.select(collect_list('metric')).first()[0]
decomposition = seasonal_decompose(data, model=model, extrapolate_trend='freq', period=period)

英文:

I am new to Spark and I am trying to understand how to write my operations so that they can be performed in a distributed fashion, as opposed to collecting massive datasets onto the driver node.

I need to perform a seasonality decomposition on time series data using the package https://pypi.org/project/statsmodels/. The function seasonal_decompose in particular operates on array-like data. How can I run this function on all of the data in a column, without having to first collect the column as a plain Python array on the driver node?

from statsmodels.api.tsa import seasonal_decompose

# df is a PySpark dataframe
data = df.select(collect_list(&#39;metric&#39;)).first()[0]
decomposition = seasonal_decompose(data, model=model, extrapolate_trend=&#39;freq&#39;, period=period)

答案1

得分: 1

有两种情况下，你可能希望使用Spark进行分布式操作：

你有一个单一的大型数据集，想要对其进行分析，例如你的结果将是所有数据点的一个结果对象。要解决这种情况，你需要一个分布式版本的分析算法，而seasonal_decompose不支持分布式分析。所以在这种情况下，你可能有点倒霉。
你的数据可以分成许多较小的数据集（例如按年份、位置、用例等划分），以便每个部分都足够小，可以在Spark集群中的一个节点上进行处理，并且你希望所有部分都可以并行处理。每个部分将产生一个特定的结果对象。要解决这种情况，你可以将数据框分成分区，然后使用mapPartitions。在mapPartitions中，你可以像在非分布式环境中一样处理数据块。

英文:

There are two cases why you would want to distribute your operations using Spark:

You have a single large dataset on which you want to perform your analysis, e.g. your result would be one result object for all of your data points. To solve this case you would need a distributed version of your analysis algorithm and seasonal_decompose does not support a distributed analysis. So you are a bit out of luck here.
Your data can be split into many smaller datasets (for example it can be split by year, location, use case, ...) so that each part is small enough to be processed onto a node in your Spark cluster and you want all parts be processed in parallel. Each part would result in a specific result object. To solve this case you can split the dataframe into partitions and then use mapPartitions. Within mapPartitions you would process the chunks of your data like you would do it in an undistributed environment.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PySpark – avoid data on driver node?

问题

答案1

PySpark执行来自不同进程的查询

表格在尝试从Databricks Spark覆盖其中的数据时被删除。

从Pyspark中的时间戳列中提取小时。

PySpark 自定义 UDF 模块未找到错误

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论