使用部分共享内存的多进程处理

huangapple go评论57阅读模式
英文:

Multiprocessing with partial shared memory

问题

我手头有一个最小二乘问题:$Xw = Y$,其中$X$是我的训练*数据*矩阵,$w$是参数化我的模型的向量我已经有了一个可以解决这个问题的代码

```python
def solve_lsqr(X,Y):
    # 一些标准操作
    return w

我想要执行一种特定类型的交叉验证,即使用特定的滞后值对我的数据进行子采样(而不是像scikit-learn中的洗牌子采样方法)。因此,与将所有数据传递给 solve_lsqr 不同,这个函数会处理以特定方式选择的数据子集。

我使用了Python的多进程来并行化我的代码,但我认为我的方法不够优化:我将整个子采样数据数组(而不是索引)传递给了这个函数。一旦并行执行,这将完全填满我的RAM并严重降低性能。

我认为更合理的做法是只将索引集传递给 solve_lsqr,并以某种方式通知它数据矩阵$(X,Y)$的位置。这需要在多进程库的Pool对象中的所有工作进程之间实现一种部分共享内存。部分共享,因为除了$(X,Y)$之外,所有工作进程的内部变量基本上都是独立的。

我如何在Python中实现这一点?使用原生的多进程库能做到吗?

感谢您的帮助。


<details>
<summary>英文:</summary>

I have a least sequre problem at hand: $Xw = Y$, where $X$ is my (training) *data* matrix and $w$ is a vector parametrizing my model. I already have a code that does it for me:

```python
def solve_lsqr(X,Y):
    # something standard
    return w

I want to perform a specific type of cross-validation, namely subsampling my data with a particular set of lags (as opposed to the shuffled subsampling approach of e.g. scikit-learn). So, instead of passing all my data to solve_lsqr, the function processes a subset of the data that is selected in a particular manner.

I used python's multiprocessing to parallelize my code, but I believe my approach is suboptimal: I pass the entire subsampled data array (not indices) to the function. This, once performed in parallel, fills my RAM completely and degrades the performance substantially.

I think it's more reasonable to just pass the index set to solve_lsqr, and somehow inform it of the location of data matrices $(X,Y)$. This requires some sort of partial shared memory among all the workers in the Pool object of multiprocessing library. Partial, because other than $(X,Y)$, all internal variables of workers are essentially independent.

How can I implement that in python? Is it even possible to do that with the native multiprocessing library?

I appreciate your help.

答案1

得分: 0

解决方案就是将数据矩阵 $X$ 和 $Y$ 声明为 global(只读)变量,并在 solve_lsqr 中使用 Python 切片来传递正确的索引。

英文:

The solution was as simple as declaring the data matrices $X,Y$ as global (read-only) variables inside solve_lsqr and passing the proper indices with python slices.

huangapple
  • 本文由 发表于 2023年2月19日 18:54:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75499612.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定