英文:
numpy with a list of Dict: Syntax to filter elements?
问题
我要筛选只返回isActive为True的元素,以下是使用NumPy的语法:
filtered_data = data[data['isActive'] == True]
请注意,你可以将True
简化为True
,因此上面的语句也可以写成:
filtered_data = data[data['isActive']]
这将返回一个包含满足条件的元素的NumPy数组。
英文:
Say I have a numpy list with each elements a Dict
data = [
{
'Account' : '111',
'RIC' : 'AAPL.OQ',
'Position' : 100,
'isActive' : True,
'Rating' : math.nan
},
{
'Account' : '111',
'RIC' : 'MSFT.OQ',
'Position' : 200,
'isActive' : False,
'Rating' : 73
},
{
'Account' : '111',
'RIC' : 'IBM.N',
'Position' : 300,
'isActive' : True,
'Rating' : math.inf
},
{
'Account' : '222',
'RIC' : 'AAPL.OQ',
'Position' : 1000,
'isActive' : False,
'Rating' : 89
},
{
'Account' : '222',
'RIC' : 'MSFT.OQ',
'Position' : 2000,
'isActive' : True,
'Rating' : np.nan
},
{
'Account' : '222',
'RIC' : 'IBM.N',
'Position' : 3000,
'isActive' : True,
'Rating' : 59
}
]
data = np.array(data)
How do I filter for example only return elements where isActive==True?
Unlike pandas, numpy don't support syntax like data[data.isActive==True]
I am looking for numpy syntax, and not look for a solution where you convert above 'data' to simple python list (then try list comprehension) or convert to pandas.
Thanks
答案1
得分: 1
根据评论建议,您可以使用记录数组。但是,NumPy可能不是合适的工具。
rec = np.core.records.fromrecords(
[tuple(d.values()) for d in data],
names=list(data[0].keys()),
formats=[np.dtype('<U16'), np.dtype('<U16'), int, bool, float],
)
>>> rec[rec.isActive == True]
rec.array([('111', 'AAPL.OQ', 100, True, nan),
('111', 'IBM.N', 300, True, inf),
('222', 'MSFT.OQ', 2000, True, nan),
('222', 'IBM.N', 3000, True, 59.)],
dtype=[('Account', '<U16'), ('RIC', '<U16'), ('Position', '<i8'), ('isActive', '?'), ('Rating', '<f8')])
英文:
As suggested in the comment you can use a record array. Still numpy might not be the right tool.
rec = np.core.records.fromrecords(
[tuple(d.values()) for d in data],
names=list(data[0].keys()),
formats=[np.dtype("<U16"), np.dtype("<U16"), int, bool, float],
)
>>> rec[rec.isActive == True]
rec.array([('111', 'AAPL.OQ', 100, True, nan),
('111', 'IBM.N', 300, True, inf),
('222', 'MSFT.OQ', 2000, True, nan),
('222', 'IBM.N', 3000, True, 59.)],
dtype=[('Account', '<U16'), ('RIC', '<U16'), ('Position', '<i8'), ('isActive', '?'), ('Rating', '<f8')])
答案2
得分: 0
你的数组是object
数据类型:
array([{'Account': '111', 'RIC': 'AAPL.OQ', 'Position': 100, 'isActive': True, 'Rating': nan},
{'Account': '111', 'RIC': 'MSFT.OQ', 'Position': 200, 'isActive': False, 'Rating': 73},
{'Account': '111', 'RIC': 'IBM.N', 'Position': 300, 'isActive': True, 'Rating': inf},
{'Account': '222', 'RIC': 'AAPL.OQ', 'Position': 1000, 'isActive': False, 'Rating': 89},
{'Account': '222', 'RIC': 'MSFT.OQ', 'Position': 2000, 'isActive': True, 'Rating': nan},
{'Account': '222', 'RIC': 'IBM.N', 'Position': 3000, 'isActive': True, 'Rating': 59}],
dtype=object)
在这样的数组中,每个元素都是对Python对象的引用,这种情况下是字典。访问方式与列表相同(但稍微慢一些):
使用列表推导式:
[elem['isActive'] for elem in arr]
结果:
[True, False, True, False, True, True]
我们可以使用nonzero
函数获取非零元素的索引(或使用另一个列表推导式):
np.nonzero(_)
结果:
(array([0, 2, 4, 5]),)
我们还可以构建一个"向量化"函数来进行选择:
np.frompyfunc(lambda x: x.__getitem__('isActive'), 1, 1)(arr)
结果:
array([True, False, True, False, True, True], dtype=object)
对于小数组,列表推导式更快;对于大数组,frompyfunc
方法可能稍微具有一定的性能优势。
Pandas为每个字典键构建了单独的数组、Series/列,并允许按列名进行选择。
直接获取索引:
[i for i, v in enumerate(arr) if v['isActive']]
结果:
[0, 2, 4, 5]
英文:
Your array is object
dtype:
In [239]: arr
Out[239]:
array([{'Account': '111', 'RIC': 'AAPL.OQ', 'Position': 100, 'isActive': True, 'Rating': nan},
{'Account': '111', 'RIC': 'MSFT.OQ', 'Position': 200, 'isActive': False, 'Rating': 73},
{'Account': '111', 'RIC': 'IBM.N', 'Position': 300, 'isActive': True, 'Rating': inf},
{'Account': '222', 'RIC': 'AAPL.OQ', 'Position': 1000, 'isActive': False, 'Rating': 89},
{'Account': '222', 'RIC': 'MSFT.OQ', 'Position': 2000, 'isActive': True, 'Rating': nan},
{'Account': '222', 'RIC': 'IBM.N', 'Position': 3000, 'isActive': True, 'Rating': 59}],
dtype=object)
In such an array, each element is a reference to python object, in this case, dicts. That's the same data layout as for a list, and access is basically same (but a bit slower):
A list comprehension:
In [240]: [a['isActive'] for a in arr]
Out[240]: [True, False, True, False, True, True]
and we can get indices for nonzero
(or another list comprehension):
In [241]: np.nonzero(_)
Out[241]: (array([0, 2, 4, 5]),)
We can also construct a "vectorized" function to do this selection:
In [247]: np.frompyfunc(lambda x: x.__getitem__('isActive'),1,1)(arr)
Out[247]: array([True, False, True, False, True, True], dtype=object)
For small arrays, list comprehension is faster; for large ones the frompyfunc
approach may have a minor scaling advantage.
pandas
constructs a separate array, Series/column, for each dict key, and allows selection by column names.
Getting the indices directly:
In [251]: [i for i,v in enumerate(arr) if v['isActive']]
Out[251]: [0, 2, 4, 5]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论