重组一个2D的NumPy数组,基于匹配的列数值。

huangapple go评论100阅读模式
英文:

restructure a 2D numpy array based on matching column values

问题

I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:

[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]

I would like to compress the array into the following format:

[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...]

Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.

Ideally, I could end up with an array of dictionaries, i.e.

[{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]

However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.

英文:

I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:

  1. [
  2. [Time 1, ID 1, D 1_1, V 1_1],
  3. [Time 1, ID 1, D 1_2, V 1_2],
  4. ...
  5. [Time 2, ID 1, D 2_1, V 2_1],
  6. [Time 2, ID 1, D 2_2, V 2_2],
  7. ...
  8. [Time X, ID 2, D X_1, V X_1],
  9. ...
  10. ]

I would like to compress the array into the following format:

  1. [
  2. [Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
  3. [Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
  4. [Time X, ID 2, D X_1, V X_1, ...],
  5. ...
  6. ]

Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.

Ideally, I could end up with an array of dictionaries, i.e.

  1. [{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]

However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.

答案1

得分: 0

我认为使用pandas,您可以轻松实现这个目标:

  1. import pandas as pd
  2. # 您的数据框
  3. df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
  4. # 按时间和ID分组,并将描述和值聚合成列表
  5. grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
  6. # 重置索引以将时间和ID作为列而不是索引
  7. result = grouped.reset_index()
  8. # 将列表转换为单独的列
  9. result['Description'] = result['Description'].apply(lambda x: ','.join(x))
  10. result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))
  11. # 将结果转换为numpy数组
  12. my_new_numpy_array = result.to_numpy()
英文:

I think with pandas you can easy achive that goal:

  1. import pandas as pd
  2. # your dataframe
  3. df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
  4. # groupby time and ID, and aggregate the descriptions and values into lists
  5. grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
  6. # reset the index to get the time and ID as columns rather than indices
  7. result = grouped.reset_index()
  8. # convert the lists into separate columns
  9. result['Description'] = result['Description'].apply(lambda x: ','.join(x))
  10. result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))
  11. # convert the result to a numpy array
  12. my_new_numpy_array = result.to_numpy()

huangapple
  • 本文由 发表于 2023年5月11日 04:49:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222461.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定