英文:
restructure a 2D numpy array based on matching column values
问题
I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:
[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]
I would like to compress the array into the following format:
[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...]
Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.
Ideally, I could end up with an array of dictionaries, i.e.
[{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]
However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.
英文:
I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:
[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]
I would like to compress the array into the following format:
[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...
]
Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.
Ideally, I could end up with an array of dictionaries, i.e.
[{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]
However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.
答案1
得分: 0
我认为使用pandas,您可以轻松实现这个目标:
import pandas as pd
# 您的数据框
df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
# 按时间和ID分组,并将描述和值聚合成列表
grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
# 重置索引以将时间和ID作为列而不是索引
result = grouped.reset_index()
# 将列表转换为单独的列
result['Description'] = result['Description'].apply(lambda x: ','.join(x))
result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))
# 将结果转换为numpy数组
my_new_numpy_array = result.to_numpy()
英文:
I think with pandas you can easy achive that goal:
import pandas as pd
# your dataframe
df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
# groupby time and ID, and aggregate the descriptions and values into lists
grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
# reset the index to get the time and ID as columns rather than indices
result = grouped.reset_index()
# convert the lists into separate columns
result['Description'] = result['Description'].apply(lambda x: ','.join(x))
result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))
# convert the result to a numpy array
my_new_numpy_array = result.to_numpy()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论