英文:
PySpark - avoid data on driver node?
问题
我是新手,正在尝试理解如何编写可以在分布式环境中执行的操作,而不是将大量数据集收集到驱动节点上。
我需要使用https://pypi.org/project/statsmodels/包对时间序列数据进行季节性分解。特别是函数seasonal_decompose
可以操作类似数组的数据。如何在不首先将列作为普通Python数组收集到驱动节点上的情况下,对列中的所有数据运行此函数?
from statsmodels.api.tsa import seasonal_decompose
# df是一个PySpark dataframe
data = df.select(collect_list('metric')).first()[0]
decomposition = seasonal_decompose(data, model=model, extrapolate_trend='freq', period=period)
英文:
I am new to Spark and I am trying to understand how to write my operations so that they can be performed in a distributed fashion, as opposed to collecting massive datasets onto the driver node.
I need to perform a seasonality decomposition on time series data using the package https://pypi.org/project/statsmodels/. The function seasonal_decompose
in particular operates on array-like data. How can I run this function on all of the data in a column, without having to first collect the column as a plain Python array on the driver node?
from statsmodels.api.tsa import seasonal_decompose
# df is a PySpark dataframe
data = df.select(collect_list('metric')).first()[0]
decomposition = seasonal_decompose(data, model=model, extrapolate_trend='freq', period=period)
答案1
得分: 1
有两种情况下,你可能希望使用Spark进行分布式操作:
-
你有一个单一的大型数据集,想要对其进行分析,例如你的结果将是所有数据点的一个结果对象。要解决这种情况,你需要一个分布式版本的分析算法,而seasonal_decompose不支持分布式分析。所以在这种情况下,你可能有点倒霉。
-
你的数据可以分成许多较小的数据集(例如按年份、位置、用例等划分),以便每个部分都足够小,可以在Spark集群中的一个节点上进行处理,并且你希望所有部分都可以并行处理。每个部分将产生一个特定的结果对象。要解决这种情况,你可以将数据框分成分区,然后使用mapPartitions。在
mapPartitions
中,你可以像在非分布式环境中一样处理数据块。
英文:
There are two cases why you would want to distribute your operations using Spark:
- You have a single large dataset on which you want to perform your analysis, e.g. your result would be one result object for all of your data points. To solve this case you would need a distributed version of your analysis algorithm and seasonal_decompose does not support a distributed analysis. So you are a bit out of luck here.
- Your data can be split into many smaller datasets (for example it can be split by year, location, use case, ...) so that each part is small enough to be processed onto a node in your Spark cluster and you want all parts be processed in parallel. Each part would result in a specific result object. To solve this case you can split the dataframe into partitions and then use mapPartitions. Within
mapPartitions
you would process the chunks of your data like you would do it in an undistributed environment.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论