2023年5月11日 04:49:56go评论100阅读模式

英文:

restructure a 2D numpy array based on matching column values

问题

I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:

[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]

I would like to compress the array into the following format:

[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...]

Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.

Ideally, I could end up with an array of dictionaries, i.e.

[{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]

However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.

英文:

I'm working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:

[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]

I would like to compress the array into the following format:

[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...
]

Ideally, I could end up with an array of dictionaries, i.e.

[{&#39;time&#39;: Time1, &#39;ID&#39;: ID1, &#39;D1_1&#39;: V1_1, &#39;D1_2&#39;: V1_2, ...}...]

However, given the time my attempt at using dictionaries would take (>100 hours), I'm assuming dictionaries take too long to construct.

答案1

得分: 0

我认为使用pandas，您可以轻松实现这个目标：

import pandas as pd
# 您的数据框
df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
# 按时间和ID分组，并将描述和值聚合成列表
grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
# 重置索引以将时间和ID作为列而不是索引
result = grouped.reset_index()
# 将列表转换为单独的列
result['Description'] = result['Description'].apply(lambda x: ','.join(x))
result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))
# 将结果转换为numpy数组
my_new_numpy_array = result.to_numpy()

英文:

I think with pandas you can easy achive that goal:

import pandas as pd
# your dataframe
df = pd.DataFrame(data=your_np_array, columns=[&#39;Time&#39;, &#39;ID&#39;, &#39;Description&#39;, &#39;Value&#39;])
# groupby time and ID, and aggregate the descriptions and values into lists
grouped = df.groupby([&#39;Time&#39;, &#39;ID&#39;]).agg({&#39;Description&#39;: list, &#39;Value&#39;: list})
# reset the index to get the time and ID as columns rather than indices
result = grouped.reset_index()
# convert the lists into separate columns
result[&#39;Description&#39;] = result[&#39;Description&#39;].apply(lambda x: &#39;,&#39;.join(x))
result[&#39;Value&#39;] = result[&#39;Value&#39;].apply(lambda x: &#39;,&#39;.join(map(str, x)))
# convert the result to a numpy array
my_new_numpy_array = result.to_numpy()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

重组一个2D的NumPy数组，基于匹配的列数值。

问题

答案1

基于列表中的键值创建新字典。

如何克隆一个带有部分数据的 postgreSQL 数据库

在Windows中，当for循环花费的时间超过通常时间时，如何抛出异常？

Access localhost from within a docker image.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。