英文:
Joining numpy arrays on matching index values in a specific column
问题
我有多个包含两列的numpy数组:一列包含测量值,另一列包含测量的年份。这些数组的长度和起始/结束点都不同。我的目标是创建一个大数组,其中包含整个数据周期的年份列,以及每个输入数组的测量列。
基本上,我想编写一个脚本,将这个:
import numpy as np
arr1 = np.array([[1920, 1921, 1922, 1924, 1925], [23, 54, 23, 54, 65]]).T
arr2 = np.array([[1922, 1923, 1924, 1925, 1926], [12, 43, 17, 42, 87]]).T
转换成这个:
year arr1 arr2
1920 23 nan
1921 54 nan
1922 23 12
1923 nan 43
1924 54 17
1925 65 42
1926 nan 87
我尝试编写一个if-else循环,检查两行的年份是否匹配,然后将测量值添加到输出数组的正确行,但似乎无法使其正常工作。我相信这是一个相当常见的任务,如果以前已经提出了这个问题,我非常抱歉,但我找不到解决方案。
非常感谢任何帮助!
英文:
I have multiple numpy arrays with two columns each: one containing a measurement, the other one the year of that measurement. The length and start/end points of those arrays are all different. My goal is to create one large array that contains one column with the years for the entire period for which I have data, and a number of columns with the measurements from each of the input arrays.
Basically I would like to write a script that turns this:
import numpy as np
arr1 = np.array([[1920, 1921, 1922, 1924, 1925], [23, 54, 23, 54, 65]]).T
arr2 = np.array([[1922, 1923, 1924, 1925, 1926], [12, 43, 17, 42, 87]]).T
into this:
year arr1 arr2
1920 23 nan
1921 54 nan
1922 23 12
1923 nan 43
1924 54 17
1925 65 42
1926 nan 87
I have tried to write an if-else loop that checks, whether the years of two rows match and then add the measurement to the correct row in the output array, but I can't seem to get it to work. I am sure that this is a fairly common task, so I am very sorry if this question has been asked before, but I wasn't able to find a solution.
Any help is greatly appreciated!
答案1
得分: 3
看起来你应该使用[tag:pandas]而不是[tag:numpy]来处理这个问题:
import pandas as pd
df = (pd.DataFrame(arr1, columns=['year', 'arr1'])
.merge(pd.DataFrame(arr2, columns=['year', 'arr2']),
on='year', how='outer')
.sort_values(by='year')
)
输出:
year arr1 arr2
0 1920 23.0 NaN
1 1921 54.0 NaN
2 1922 23.0 12.0
5 1923 NaN 43.0
3 1924 54.0 17.0
4 1925 65.0 42.0
6 1926 NaN 87.0
如果需要进一步的帮助,请告诉我。
英文:
It looks like you shouldn't be using [tag:numpy] for this but rather [tag:pandas]:
import pandas as pd
df = (pd.DataFrame(arr1, columns=['year', 'arr1'])
.merge(pd.DataFrame(arr2, columns=['year', 'arr2']),
on='year', how='outer')
.sort_values(by='year')
)
Output:
year arr1 arr2
0 1920 23.0 NaN
1 1921 54.0 NaN
2 1922 23.0 12.0
5 1923 NaN 43.0
3 1924 54.0 17.0
4 1925 65.0 42.0
6 1926 NaN 87.0
答案2
得分: 1
I worked out this answer mostly for my own learning.
这个答案主要是为了我自己的学习而准备的。
Here's an example using numpy.librecfunctions
. I don't have much experience with this feature. And I don't think it is heavily used, especially with the more powerful pandas
now. But for what it's worth.
以下是使用 numpy.librecfunctions
的示例。我对这个功能不太熟悉。我认为它并没有被广泛使用,尤其是现在有了更强大的 pandas
。但还是值得一提的。
In [14]: import numpy.lib.recfunctions as rf
在 [14] 中:导入 numpy.lib.recfunctions 作为 rf。
Make structured arrays from your arrays, with the same 'idx' field name, but different data fields:
从你的数组中创建结构化数组,具有相同的 'idx' 字段名称,但具有不同的数据字段:
In [17]: rarr1 = rf.unstructured_to_structured(arr1, names=['idx', 'col1'])
In [18]: rarr1
Out[18]:
array([(1920, 23), (1921, 54), (1922, 23), (1924, 54), (1925, 65)],
dtype=[('idx', '<i4'), ('col1', '<i4')])
在 [17] 中:使用 arr1 创建结构化数组 rarr1,字段名称为 ['idx', 'col1']。
在 [18] 中:rarr1 的输出结果。
In [19]: rarr2 = rf.unstructured_to_structured(arr2, names=['idx', 'col1'])
在 [19] 中:使用 arr2 创建结构化数组 rarr2,字段名称为 ['idx', 'col1']。
Using join_by
:
使用 join_by
:
In [22]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer')
在 [22] 中:使用 join_by
进行连接,连接字段为 'idx',连接 rarr1 和 rarr2,连接方式为 'outer'。
In [23]: rjoint
Out[23]:
masked_array(data=[(1920, 23, --), (1921, 54, --), (1922, 23, 12),
(1923, --, 43), (1924, 54, 17), (1925, 65, 42),
(1926, --, 87)],
mask=[(False, False, True), (False, False, True),
(False, False, False), (False, True, False),
(False, False, False), (False, False, False),
(False, True, False)],
fill_value=(999999, 999999, 999999),
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])
在 [23] 中:rjoint 的输出结果,包含了连接后的数据以及掩码信息。
or without the mask (99999 is the 'int' equivalent of a 'nan' fill)
或者不使用掩码(99999 是 'int' 类型的 'nan' 填充的等价值)
In [27]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer', usemask=False)
在 [27] 中:使用 join_by
进行连接,连接字段为 'idx',连接 rarr1 和 rarr2,连接方式为 'outer',不使用掩码。
In [28]: rjoint
Out[28]:
array([(1920, 23, 999999), (1921, 54, 999999),
(1922, 23, 12), (1923, 999999, 43),
(1924, 54, 17), (1925, 65, 42),
(1926, 999999, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])
在 [28] 中:rjoint 的输出结果,不包含掩码,使用 999999 作为 'nan' 的填充值。
Or using defaults (nan does not play nicely with int values)
或者使用默认值(nan 与 int 值不太兼容)
In [32]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer', usemask=False, defaults={'col1': -1, 'col2': -1})
在 [32] 中:使用 join_by
进行连接,连接字段为 'idx',连接 rarr1 和 rarr2,连接方式为 'outer',不使用掩码,使用默认值 {'col1': -1, 'col2': -1}。
In [33]: rjoint
Out[33]:
array([(1920, 23, -1), (1921, 54, -1), (1922, 23, 12), (1923, -1, 43),
(1924, 54, 17), (1925, 65, 42), (1926, -1, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])
在 [33] 中:rjoint 的输出结果,使用默认值进行填充。
In [34]: rf.structured_to_unstructured(rjoint)
在 [34] 中:使用 structured_to_unstructured
将结构化数组 rjoint 转换为非结构化数组。
英文:
I worked out this answer mostly for my own learning.
Here's an example using numpy.librecfunctions
. I don't have much experience with this feature. And I don't think it is heavily used, especially with the more powerful panadas
now. But for what it's worth.
In [14]: import numpy.lib.recfunctions as rf
Make structured arrays from your arrays, with the same 'idx' field name, but different data fields:
In [17]: rarr1=rf.unstructured_to_structured(arr1,names=['idx','col1'])
In [18]: rarr1
Out[18]:
array([(1920, 23), (1921, 54), (1922, 23), (1924, 54), (1925, 65)],
dtype=[('idx', '<i4'), ('col1', '<i4')])
In [19]: rarr2=rf.unstructured_to_structured(arr2, names=['idx','col1'])
Using join_by
:
In [22]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer')
In [23]: rjoint
Out[23]:
masked_array(data=[(1920, 23, --), (1921, 54, --), (1922, 23, 12),
(1923, --, 43), (1924, 54, 17), (1925, 65, 42),
(1926, --, 87)],
mask=[(False, False, True), (False, False, True),
(False, False, False), (False, True, False),
(False, False, False), (False, False, False),
(False, True, False)],
fill_value=(999999, 999999, 999999),
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])
or without the mask (99999 is the 'int' equivalent of a 'nan' fill)
In [27]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer',usemask=False)
In [28]: rjoint
Out[28]:
array([(1920, 23, 999999), (1921, 54, 999999),
(1922, 23, 12), (1923, 999999, 43),
(1924, 54, 17), (1925, 65, 42),
(1926, 999999, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])
Or using defaults (nan does not play nicely with int values)
In [32]: rjoint = rf.join_by('idx', rarr1, rarr2, 'outer',usemask=False, defaults={'col1':-1, 'col2':-1})
In [33]: rjoint
Out[33]:
array([(1920, 23, -1), (1921, 54, -1), (1922, 23, 12), (1923, -1, 43),
(1924, 54, 17), (1925, 65, 42), (1926, -1, 87)],
dtype=[('idx', '<i4'), ('col1', '<i4'), ('col2', '<i4')])
In [34]: rf.structured_to_unstructured(rjoint)
Out[34]:
array([[1920, 23, -1],
[1921, 54, -1],
[1922, 23, 12],
[1923, -1, 43],
[1924, 54, 17],
[1925, 65, 42],
[1926, -1, 87]])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论