2023年5月25日 22:39:46go评论63阅读模式

英文:

Joining numpy arrays on matching index values in a specific column

问题

我有多个包含两列的numpy数组：一列包含测量值，另一列包含测量的年份。这些数组的长度和起始/结束点都不同。我的目标是创建一个大数组，其中包含整个数据周期的年份列，以及每个输入数组的测量列。

基本上，我想编写一个脚本，将这个：

import numpy as np

arr1 = np.array([[1920, 1921, 1922, 1924, 1925], [23, 54, 23, 54, 65]]).T
arr2 = np.array([[1922, 1923, 1924, 1925, 1926], [12, 43, 17, 42, 87]]).T

转换成这个：

year  arr1  arr2
1920  23    nan
1921  54    nan
1922  23    12
1923  nan   43
1924  54    17
1925  65    42
1926  nan   87

我尝试编写一个if-else循环，检查两行的年份是否匹配，然后将测量值添加到输出数组的正确行，但似乎无法使其正常工作。我相信这是一个相当常见的任务，如果以前已经提出了这个问题，我非常抱歉，但我找不到解决方案。

非常感谢任何帮助！

英文:

I have multiple numpy arrays with two columns each: one containing a measurement, the other one the year of that measurement. The length and start/end points of those arrays are all different. My goal is to create one large array that contains one column with the years for the entire period for which I have data, and a number of columns with the measurements from each of the input arrays.

Basically I would like to write a script that turns this:

import numpy as np

arr1 = np.array([[1920, 1921, 1922, 1924, 1925], [23, 54, 23, 54, 65]]).T
arr2 = np.array([[1922, 1923, 1924, 1925, 1926], [12, 43, 17, 42, 87]]).T

into this:

year  arr1  arr2
1920  23    nan
1921  54    nan
1922  23    12
1923  nan   43
1924  54    17
1925  65    42
1926  nan   87

I have tried to write an if-else loop that checks, whether the years of two rows match and then add the measurement to the correct row in the output array, but I can't seem to get it to work. I am sure that this is a fairly common task, so I am very sorry if this question has been asked before, but I wasn't able to find a solution.

Any help is greatly appreciated!

答案1

得分: 3

看起来你应该使用[tag:pandas]而不是[tag:numpy]来处理这个问题：

import pandas as pd

df = (pd.DataFrame(arr1, columns=['year', 'arr1'])
        .merge(pd.DataFrame(arr2, columns=['year', 'arr2']),
               on='year', how='outer')
        .sort_values(by='year')
      )

输出：

   year  arr1  arr2
0  1920  23.0   NaN
1  1921  54.0   NaN
2  1922  23.0  12.0
5  1923   NaN  43.0
3  1924  54.0  17.0
4  1925  65.0  42.0
6  1926   NaN  87.0

如果需要进一步的帮助，请告诉我。

英文:

It looks like you shouldn't be using [tag:numpy] for this but rather [tag:pandas]:

import pandas as pd

df = (pd.DataFrame(arr1, columns=[&#39;year&#39;, &#39;arr1&#39;])
        .merge(pd.DataFrame(arr2, columns=[&#39;year&#39;, &#39;arr2&#39;]),
               on=&#39;year&#39;, how=&#39;outer&#39;)
        .sort_values(by=&#39;year&#39;)
      )

Output:

   year  arr1  arr2
0  1920  23.0   NaN
1  1921  54.0   NaN
2  1922  23.0  12.0
5  1923   NaN  43.0
3  1924  54.0  17.0
4  1925  65.0  42.0
6  1926   NaN  87.0

答案2

得分: 1

I worked out this answer mostly for my own learning.

这个答案主要是为了我自己的学习而准备的。

Here's an example using numpy.librecfunctions. I don't have much experience with this feature. And I don't think it is heavily used, especially with the more powerful pandas now. But for what it's worth.

以下是使用 numpy.librecfunctions 的示例。我对这个功能不太熟悉。我认为它并没有被广泛使用，尤其是现在有了更强大的 pandas。但还是值得一提的。

In [14]: import numpy.lib.recfunctions as rf

在 [14] 中：导入 numpy.lib.recfunctions 作为 rf。

Make structured arrays from your arrays, with the same 'idx' field name, but different data fields:

从你的数组中创建结构化数组，具有相同的 'idx' 字段名称，但具有不同的数据字段：

In [17]: rarr1 = rf.unstructured_to_structured(arr1, names=['idx', 'col1'])
In [18]: rarr1
Out[18]:
array([(1920, 23), (1921, 54), (1922, 23), (1924, 54), (1925, 65)],
dtype=[('idx', '<i4'), ('col1', '<i4')])

在 [17] 中：使用 arr1 创建结构化数组 rarr1，字段名称为 ['idx', 'col1']。
在 [18] 中：rarr1 的输出结果。

In [19]: rarr2 = rf.unstructured_to_structured(arr2, names=['idx', 'col1'])

在 [19] 中：使用 arr2 创建结构化数组 rarr2，字段名称为 ['idx', 'col1']。

Using join_by:

使用 join_by：

In [22]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer')

在 [22] 中：使用 join_by 进行连接，连接字段为 'idx'，连接 rarr1 和 rarr2，连接方式为 'outer'。

In [23]: rjoint
Out[23]:
masked_array(data=[(1920, 23, --), (1921, 54, --), (1922, 23, 12),
(1923, --, 43), (1924, 54, 17), (1925, 65, 42),
(1926, --, 87)],
mask=[(False, False, True), (False, False, True),
(False, False, False), (False, True, False),
(False, False, False), (False, False, False),
(False, True, False)],
fill_value=(999999, 999999, 999999),
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])

在 [23] 中：rjoint 的输出结果，包含了连接后的数据以及掩码信息。

or without the mask (99999 is the 'int' equivalent of a 'nan' fill)

或者不使用掩码（99999 是 'int' 类型的 'nan' 填充的等价值）

In [27]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer', usemask=False)

在 [27] 中：使用 join_by 进行连接，连接字段为 'idx'，连接 rarr1 和 rarr2，连接方式为 'outer'，不使用掩码。

In [28]: rjoint
Out[28]:
array([(1920, 23, 999999), (1921, 54, 999999),
(1922, 23, 12), (1923, 999999, 43),
(1924, 54, 17), (1925, 65, 42),
(1926, 999999, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])

在 [28] 中：rjoint 的输出结果，不包含掩码，使用 999999 作为 'nan' 的填充值。

Or using defaults (nan does not play nicely with int values)

或者使用默认值（nan 与 int 值不太兼容）

In [32]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer', usemask=False, defaults={'col1': -1, 'col2': -1})

在 [32] 中：使用 join_by 进行连接，连接字段为 'idx'，连接 rarr1 和 rarr2，连接方式为 'outer'，不使用掩码，使用默认值 {'col1': -1, 'col2': -1}。

In [33]: rjoint
Out[33]:
array([(1920, 23, -1), (1921, 54, -1), (1922, 23, 12), (1923, -1, 43),
(1924, 54, 17), (1925, 65, 42), (1926, -1, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])

在 [33] 中：rjoint 的输出结果，使用默认值进行填充。

In [34]: rf.structured_to_unstructured(rjoint)

在 [34] 中：使用 structured_to_unstructured 将结构化数组 rjoint 转换为非结构化数组。

英文:

I worked out this answer mostly for my own learning.

Here's an example using numpy.librecfunctions. I don't have much experience with this feature. And I don't think it is heavily used, especially with the more powerful panadas now. But for what it's worth.

In [14]: import numpy.lib.recfunctions as rf

Make structured arrays from your arrays, with the same 'idx' field name, but different data fields:

In [17]: rarr1=rf.unstructured_to_structured(arr1,names=[&#39;idx&#39;,&#39;col1&#39;])
In [18]: rarr1
Out[18]: 
array([(1920, 23), (1921, 54), (1922, 23), (1924, 54), (1925, 65)],
      dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;)])

In [19]: rarr2=rf.unstructured_to_structured(arr2, names=[&#39;idx&#39;,&#39;col1&#39;])

Using join_by:

In [22]: rjoint = rf.join_by(&#39;idx&#39;, rarr1, rarr2, &#39;outer&#39;)

In [23]: rjoint
Out[23]: 
masked_array(data=[(1920, 23, --), (1921, 54, --), (1922, 23, 12),
                   (1923, --, 43), (1924, 54, 17), (1925, 65, 42),
                   (1926, --, 87)],
             mask=[(False, False,  True), (False, False,  True),
                   (False, False, False), (False,  True, False),
                   (False, False, False), (False, False, False),
                   (False,  True, False)],
       fill_value=(999999, 999999, 999999),
            dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;), (&#39;col2&#39;, &#39;&lt;i4&#39;)])

or without the mask (99999 is the 'int' equivalent of a 'nan' fill)

In [27]: rjoint = rf.join_by(&#39;idx&#39;, rarr1, rarr2, &#39;outer&#39;,usemask=False)

In [28]: rjoint
Out[28]: 
array([(1920,     23, 999999), (1921,     54, 999999),
       (1922,     23,     12), (1923, 999999,     43),
       (1924,     54,     17), (1925,     65,     42),
       (1926, 999999,     87)],
      dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;), (&#39;col2&#39;, &#39;&lt;i4&#39;)])

Or using defaults (nan does not play nicely with int values)

In [32]: rjoint = rf.join_by(&#39;idx&#39;, rarr1, rarr2, &#39;outer&#39;,usemask=False, defaults={&#39;col1&#39;:-1, &#39;col2&#39;:-1})

In [33]: rjoint
Out[33]: 
array([(1920, 23, -1), (1921, 54, -1), (1922, 23, 12), (1923, -1, 43),
       (1924, 54, 17), (1925, 65, 42), (1926, -1, 87)],
      dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;), (&#39;col2&#39;, &#39;&lt;i4&#39;)])

In [34]: rf.structured_to_unstructured(rjoint)
Out[34]: 
array([[1920,   23,   -1],
       [1921,   54,   -1],
       [1922,   23,   12],
       [1923,   -1,   43],
       [1924,   54,   17],
       [1925,   65,   42],
       [1926,   -1,   87]])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在特定列上匹配索引值连接numpy数组。

问题

答案1

答案2

在时间序列中随时间增加一个数值

如何管理Elasticsearch索引数据

在Python中每个种子的正态分布的均值和方差

2个n维矩阵的欧几里得距离矩阵，无需循环。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论