重组一个2D的NumPy数组,基于匹配的列数值。

huangapple go评论69阅读模式
英文:

restructure a 2D numpy array based on matching column values

问题

I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:

[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]

I would like to compress the array into the following format:

[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...]

Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.

Ideally, I could end up with an array of dictionaries, i.e.

[{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]

However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.

英文:

I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:

[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]

I would like to compress the array into the following format:

[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...
]

Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.

Ideally, I could end up with an array of dictionaries, i.e.

[{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]

However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.

答案1

得分: 0

我认为使用pandas,您可以轻松实现这个目标:

import pandas as pd

# 您的数据框
df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
# 按时间和ID分组,并将描述和值聚合成列表
grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
# 重置索引以将时间和ID作为列而不是索引
result = grouped.reset_index()
# 将列表转换为单独的列
result['Description'] = result['Description'].apply(lambda x: ','.join(x))
result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))

# 将结果转换为numpy数组
my_new_numpy_array = result.to_numpy()
英文:

I think with pandas you can easy achive that goal:

import pandas as pd

# your dataframe
df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
# groupby time and ID, and aggregate the descriptions and values into lists
grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
# reset the index to get the time and ID as columns rather than indices
result = grouped.reset_index()
# convert the lists into separate columns
result['Description'] = result['Description'].apply(lambda x: ','.join(x))
result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))

# convert the result to a numpy array
my_new_numpy_array = result.to_numpy()

huangapple
  • 本文由 发表于 2023年5月11日 04:49:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222461.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定