问题

我手头有一个最小二乘问题：$Xw = Y$，其中$X$是我的（训练）*数据*矩阵，$w$是参数化我的模型的向量。我已经有了一个可以解决这个问题的代码：

```python
def solve_lsqr(X,Y):
    # 一些标准操作
    return w

我想要执行一种特定类型的交叉验证，即使用特定的滞后值对我的数据进行子采样（而不是像scikit-learn中的洗牌子采样方法）。因此，与将所有数据传递给 solve_lsqr 不同，这个函数会处理以特定方式选择的数据子集。

我使用了Python的多进程来并行化我的代码，但我认为我的方法不够优化：我将整个子采样数据数组（而不是索引）传递给了这个函数。一旦并行执行，这将完全填满我的RAM并严重降低性能。

我认为更合理的做法是只将索引集传递给 solve_lsqr，并以某种方式通知它数据矩阵$(X,Y)$的位置。这需要在多进程库的Pool对象中的所有工作进程之间实现一种部分共享内存。部分共享，因为除了$(X,Y)$之外，所有工作进程的内部变量基本上都是独立的。

我如何在Python中实现这一点？使用原生的多进程库能做到吗？

感谢您的帮助。


<details>
<summary>英文:</summary>

I have a least sequre problem at hand: $Xw = Y$, where $X$ is my (training) *data* matrix and $w$ is a vector parametrizing my model. I already have a code that does it for me:

```python
def solve_lsqr(X,Y):
    # something standard
    return w

I want to perform a specific type of cross-validation, namely subsampling my data with a particular set of lags (as opposed to the shuffled subsampling approach of e.g. scikit-learn). So, instead of passing all my data to solve_lsqr, the function processes a subset of the data that is selected in a particular manner.

I used python's multiprocessing to parallelize my code, but I believe my approach is suboptimal: I pass the entire subsampled data array (not indices) to the function. This, once performed in parallel, fills my RAM completely and degrades the performance substantially.

I think it's more reasonable to just pass the index set to solve_lsqr, and somehow inform it of the location of data matrices $(X,Y)$. This requires some sort of partial shared memory among all the workers in the Pool object of multiprocessing library. Partial, because other than $(X,Y)$, all internal variables of workers are essentially independent.

How can I implement that in python? Is it even possible to do that with the native multiprocessing library?

I appreciate your help.

答案1

得分: 0

解决方案就是将数据矩阵 $X$ 和 $Y$ 声明为 global（只读）变量，并在 solve_lsqr 中使用 Python 切片来传递正确的索引。

英文:

The solution was as simple as declaring the data matrices $X,Y$ as global (read-only) variables inside solve_lsqr and passing the proper indices with python slices.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用部分共享内存的多进程处理

问题

答案1

条件性的 For 循环

使用r5r进行并行处理。

禁用 PostgreSQL 索引更新暂时，并稍后手动更新索引以提高插入语句性能。

requests.exceptions.InvalidSchema: 找不到连接适配器。我正在尝试遍历一个列表

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论