在特定列上匹配索引值连接numpy数组。

huangapple go评论63阅读模式
英文:

Joining numpy arrays on matching index values in a specific column

问题

我有多个包含两列的numpy数组:一列包含测量值,另一列包含测量的年份。这些数组的长度和起始/结束点都不同。我的目标是创建一个大数组,其中包含整个数据周期的年份列,以及每个输入数组的测量列。

基本上,我想编写一个脚本,将这个:

import numpy as np

arr1 = np.array([[1920, 1921, 1922, 1924, 1925], [23, 54, 23, 54, 65]]).T
arr2 = np.array([[1922, 1923, 1924, 1925, 1926], [12, 43, 17, 42, 87]]).T

转换成这个:

year  arr1  arr2
1920  23    nan
1921  54    nan
1922  23    12
1923  nan   43
1924  54    17
1925  65    42
1926  nan   87

我尝试编写一个if-else循环,检查两行的年份是否匹配,然后将测量值添加到输出数组的正确行,但似乎无法使其正常工作。我相信这是一个相当常见的任务,如果以前已经提出了这个问题,我非常抱歉,但我找不到解决方案。

非常感谢任何帮助!

英文:

I have multiple numpy arrays with two columns each: one containing a measurement, the other one the year of that measurement. The length and start/end points of those arrays are all different. My goal is to create one large array that contains one column with the years for the entire period for which I have data, and a number of columns with the measurements from each of the input arrays.

Basically I would like to write a script that turns this:

import numpy as np

arr1 = np.array([[1920, 1921, 1922, 1924, 1925], [23, 54, 23, 54, 65]]).T
arr2 = np.array([[1922, 1923, 1924, 1925, 1926], [12, 43, 17, 42, 87]]).T

into this:

year  arr1  arr2
1920  23    nan
1921  54    nan
1922  23    12
1923  nan   43
1924  54    17
1925  65    42
1926  nan   87

I have tried to write an if-else loop that checks, whether the years of two rows match and then add the measurement to the correct row in the output array, but I can't seem to get it to work. I am sure that this is a fairly common task, so I am very sorry if this question has been asked before, but I wasn't able to find a solution.

Any help is greatly appreciated!

答案1

得分: 3

看起来你应该使用[tag:pandas]而不是[tag:numpy]来处理这个问题:

import pandas as pd

df = (pd.DataFrame(arr1, columns=['year', 'arr1'])
        .merge(pd.DataFrame(arr2, columns=['year', 'arr2']),
               on='year', how='outer')
        .sort_values(by='year')
      )

输出:

   year  arr1  arr2
0  1920  23.0   NaN
1  1921  54.0   NaN
2  1922  23.0  12.0
5  1923   NaN  43.0
3  1924  54.0  17.0
4  1925  65.0  42.0
6  1926   NaN  87.0

如果需要进一步的帮助,请告诉我。

英文:

It looks like you shouldn't be using [tag:numpy] for this but rather [tag:pandas]:

import pandas as pd

df = (pd.DataFrame(arr1, columns=['year', 'arr1'])
        .merge(pd.DataFrame(arr2, columns=['year', 'arr2']),
               on='year', how='outer')
        .sort_values(by='year')
      )

Output:

   year  arr1  arr2
0  1920  23.0   NaN
1  1921  54.0   NaN
2  1922  23.0  12.0
5  1923   NaN  43.0
3  1924  54.0  17.0
4  1925  65.0  42.0
6  1926   NaN  87.0

答案2

得分: 1

I worked out this answer mostly for my own learning.

这个答案主要是为了我自己的学习而准备的。

Here's an example using numpy.librecfunctions. I don't have much experience with this feature. And I don't think it is heavily used, especially with the more powerful pandas now. But for what it's worth.

以下是使用 numpy.librecfunctions 的示例。我对这个功能不太熟悉。我认为它并没有被广泛使用,尤其是现在有了更强大的 pandas。但还是值得一提的。

In [14]: import numpy.lib.recfunctions as rf

在 [14] 中:导入 numpy.lib.recfunctions 作为 rf。

Make structured arrays from your arrays, with the same 'idx' field name, but different data fields:

从你的数组中创建结构化数组,具有相同的 'idx' 字段名称,但具有不同的数据字段:

In [17]: rarr1 = rf.unstructured_to_structured(arr1, names=['idx', 'col1'])
In [18]: rarr1
Out[18]:
array([(1920, 23), (1921, 54), (1922, 23), (1924, 54), (1925, 65)],
dtype=[('idx', '<i4'), ('col1', '<i4')])

在 [17] 中:使用 arr1 创建结构化数组 rarr1,字段名称为 ['idx', 'col1']。
在 [18] 中:rarr1 的输出结果。

In [19]: rarr2 = rf.unstructured_to_structured(arr2, names=['idx', 'col1'])

在 [19] 中:使用 arr2 创建结构化数组 rarr2,字段名称为 ['idx', 'col1']。

Using join_by:

使用 join_by

In [22]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer')

在 [22] 中:使用 join_by 进行连接,连接字段为 'idx',连接 rarr1 和 rarr2,连接方式为 'outer'。

In [23]: rjoint
Out[23]:
masked_array(data=[(1920, 23, --), (1921, 54, --), (1922, 23, 12),
(1923, --, 43), (1924, 54, 17), (1925, 65, 42),
(1926, --, 87)],
mask=[(False, False, True), (False, False, True),
(False, False, False), (False, True, False),
(False, False, False), (False, False, False),
(False, True, False)],
fill_value=(999999, 999999, 999999),
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])

在 [23] 中:rjoint 的输出结果,包含了连接后的数据以及掩码信息。

or without the mask (99999 is the 'int' equivalent of a 'nan' fill)

或者不使用掩码(99999 是 'int' 类型的 'nan' 填充的等价值)

In [27]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer', usemask=False)

在 [27] 中:使用 join_by 进行连接,连接字段为 'idx',连接 rarr1 和 rarr2,连接方式为 'outer',不使用掩码。

In [28]: rjoint
Out[28]:
array([(1920, 23, 999999), (1921, 54, 999999),
(1922, 23, 12), (1923, 999999, 43),
(1924, 54, 17), (1925, 65, 42),
(1926, 999999, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])

在 [28] 中:rjoint 的输出结果,不包含掩码,使用 999999 作为 'nan' 的填充值。

Or using defaults (nan does not play nicely with int values)

或者使用默认值(nan 与 int 值不太兼容)

In [32]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer', usemask=False, defaults={'col1': -1, 'col2': -1})

在 [32] 中:使用 join_by 进行连接,连接字段为 'idx',连接 rarr1 和 rarr2,连接方式为 'outer',不使用掩码,使用默认值 {'col1': -1, 'col2': -1}。

In [33]: rjoint
Out[33]:
array([(1920, 23, -1), (1921, 54, -1), (1922, 23, 12), (1923, -1, 43),
(1924, 54, 17), (1925, 65, 42), (1926, -1, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])

在 [33] 中:rjoint 的输出结果,使用默认值进行填充。

In [34]: rf.structured_to_unstructured(rjoint)

在 [34] 中:使用 structured_to_unstructured 将结构化数组 rjoint 转换为非结构化数组。

英文:

I worked out this answer mostly for my own learning.

Here's an example using numpy.librecfunctions. I don't have much experience with this feature. And I don't think it is heavily used, especially with the more powerful panadas now. But for what it's worth.

In [14]: import numpy.lib.recfunctions as rf

Make structured arrays from your arrays, with the same 'idx' field name, but different data fields:

In [17]: rarr1=rf.unstructured_to_structured(arr1,names=[&#39;idx&#39;,&#39;col1&#39;])
In [18]: rarr1
Out[18]: 
array([(1920, 23), (1921, 54), (1922, 23), (1924, 54), (1925, 65)],
      dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;)])

In [19]: rarr2=rf.unstructured_to_structured(arr2, names=[&#39;idx&#39;,&#39;col1&#39;])

Using join_by:

In [22]: rjoint = rf.join_by(&#39;idx&#39;, rarr1, rarr2, &#39;outer&#39;)

In [23]: rjoint
Out[23]: 
masked_array(data=[(1920, 23, --), (1921, 54, --), (1922, 23, 12),
                   (1923, --, 43), (1924, 54, 17), (1925, 65, 42),
                   (1926, --, 87)],
             mask=[(False, False,  True), (False, False,  True),
                   (False, False, False), (False,  True, False),
                   (False, False, False), (False, False, False),
                   (False,  True, False)],
       fill_value=(999999, 999999, 999999),
            dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;), (&#39;col2&#39;, &#39;&lt;i4&#39;)])

or without the mask (99999 is the 'int' equivalent of a 'nan' fill)

In [27]: rjoint = rf.join_by(&#39;idx&#39;, rarr1, rarr2, &#39;outer&#39;,usemask=False)

In [28]: rjoint
Out[28]: 
array([(1920,     23, 999999), (1921,     54, 999999),
       (1922,     23,     12), (1923, 999999,     43),
       (1924,     54,     17), (1925,     65,     42),
       (1926, 999999,     87)],
      dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;), (&#39;col2&#39;, &#39;&lt;i4&#39;)])

Or using defaults (nan does not play nicely with int values)

In [32]: rjoint = rf.join_by(&#39;idx&#39;, rarr1, rarr2, &#39;outer&#39;,usemask=False, defaults={&#39;col1&#39;:-1, &#39;col2&#39;:-1})

In [33]: rjoint
Out[33]: 
array([(1920, 23, -1), (1921, 54, -1), (1922, 23, 12), (1923, -1, 43),
       (1924, 54, 17), (1925, 65, 42), (1926, -1, 87)],
      dtype=[(&#39;idx&#39;, &#39;&lt;i4&#39;), (&#39;col1&#39;, &#39;&lt;i4&#39;), (&#39;col2&#39;, &#39;&lt;i4&#39;)])

In [34]: rf.structured_to_unstructured(rjoint)
Out[34]: 
array([[1920,   23,   -1],
       [1921,   54,   -1],
       [1922,   23,   12],
       [1923,   -1,   43],
       [1924,   54,   17],
       [1925,   65,   42],
       [1926,   -1,   87]])

huangapple
  • 本文由 发表于 2023年5月25日 22:39:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333512.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定