英文:
Multiprocessing with partial shared memory
问题
我手头有一个最小二乘问题:$Xw = Y$,其中$X$是我的(训练)*数据*矩阵,$w$是参数化我的模型的向量。我已经有了一个可以解决这个问题的代码:
```python
def solve_lsqr(X,Y):
# 一些标准操作
return w
我想要执行一种特定类型的交叉验证,即使用特定的滞后值对我的数据进行子采样(而不是像scikit-learn中的洗牌子采样方法)。因此,与将所有数据传递给 solve_lsqr
不同,这个函数会处理以特定方式选择的数据子集。
我使用了Python的多进程来并行化我的代码,但我认为我的方法不够优化:我将整个子采样数据数组(而不是索引)传递给了这个函数。一旦并行执行,这将完全填满我的RAM并严重降低性能。
我认为更合理的做法是只将索引集传递给 solve_lsqr
,并以某种方式通知它数据矩阵$(X,Y)$的位置。这需要在多进程库的Pool
对象中的所有工作进程之间实现一种部分共享内存。部分共享,因为除了$(X,Y)$之外,所有工作进程的内部变量基本上都是独立的。
我如何在Python中实现这一点?使用原生的多进程库能做到吗?
感谢您的帮助。
<details>
<summary>英文:</summary>
I have a least sequre problem at hand: $Xw = Y$, where $X$ is my (training) *data* matrix and $w$ is a vector parametrizing my model. I already have a code that does it for me:
```python
def solve_lsqr(X,Y):
# something standard
return w
I want to perform a specific type of cross-validation, namely subsampling my data with a particular set of lags (as opposed to the shuffled subsampling approach of e.g. scikit-learn). So, instead of passing all my data to solve_lsqr
, the function processes a subset of the data that is selected in a particular manner.
I used python's multiprocessing to parallelize my code, but I believe my approach is suboptimal: I pass the entire subsampled data array (not indices) to the function. This, once performed in parallel, fills my RAM completely and degrades the performance substantially.
I think it's more reasonable to just pass the index set to solve_lsqr
, and somehow inform it of the location of data matrices $(X,Y)$. This requires some sort of partial shared memory among all the workers in the Pool
object of multiprocessing library. Partial, because other than $(X,Y)$, all internal variables of workers are essentially independent.
How can I implement that in python? Is it even possible to do that with the native multiprocessing library?
I appreciate your help.
答案1
得分: 0
解决方案就是将数据矩阵 $X$ 和 $Y$ 声明为 global
(只读)变量,并在 solve_lsqr
中使用 Python 切片来传递正确的索引。
英文:
The solution was as simple as declaring the data matrices $X,Y$ as global
(read-only) variables inside solve_lsqr
and passing the proper indices with python slices.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论