2023年6月5日 01:12:09go评论119阅读模式

英文:

Speeding up loop for reorganizing pandas DataFrame into numpy array using slicing throws exception - what am I missing?

问题

你可以使用以下代码来优化这个过程，避免使用显式的循环：

import pandas as pd
import numpy as np
# 原始数据
raw_data = pd.DataFrame({
    'date_idx': [0, 1, 2, 0, 1, 2],
    'element_idx': [0, 0, 0, 1, 1, 1],
    'a': [10, 20, 30, 40, 50, 60],
    'b': [11, 21, 31, 41, 51, 61],
    'c': [12, 22, 32, 42, 52, 62],
})
# 定义列名
inputs = ['a', 'b', 'c']
# 获取唯一日期和元素索引值
unique_dates = raw_data['date_idx'].unique()
unique_elements = raw_data['element_idx'].unique()
# 创建一个新的numpy数组
data = np.zeros(shape=(len(unique_dates), len(inputs), len(unique_elements)), dtype=np.float64)
# 使用Pandas的pivot方法来重新排列数据
pivot_data = raw_data.pivot(index='date_idx', columns=['element_idx'], values=inputs)
# 转换为NumPy数组
data = pivot_data.to_numpy()
print(data)

这段代码使用了Pandas的pivot方法，将原始数据重新排列成你所需的形式，然后将其转换为NumPy数组，避免了显式的循环，提高了运行速度。

英文:

I have a pandas DataFrame like so:

raw_data = DataFrame({
    &#39;date_idx&#39;: [0, 1, 2, 0, 1, 2],
    &#39;element_idx&#39;: [0, 0, 0, 1, 1, 1],
    &#39;a&#39;: [10, 20, 30, 40, 50, 60],
    &#39;b&#39;: [11, 21, 31, 41, 51, 61],
    &#39;c&#39;: [12, 22, 32, 42, 52, 62],
})

I call the columns other than date_idx and element_idx "inputs". I want to reorganize it into a 3d numpy array by date_idx -> input_idx -> element_idx, so that the result is like so:

[[[10. 40.]
  [11. 41.]
  [12. 42.]]
 [[20. 50.]
  [21. 51.]
  [22. 52.]]
 [[30. 60.]
  [31. 61.]
  [32. 62.]]]

I did it with two for loops, and it works well:

date_idx = [0, 1, 2, 0, 1, 2]
element_idx = [0, 0, 0, 1, 1, 1]
raw_data = DataFrame({
    &#39;date_idx&#39;: date_idx,
    &#39;element_idx&#39;: element_idx,
    &#39;a&#39;: [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
    &#39;b&#39;: [11.0, 21.0, 31.0, 41.0, 51.0, 61.0],
    &#39;c&#39;: [12.0, 22.0, 32.0, 42.0, 52.0, 62.0],
})
inputs = [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]
unique_dates = set(date_idx)
unique_elements = set(element_idx)
data = np.zeros(shape=(len(unique_dates), len(inputs), len(unique_elements)), dtype=np.float64)
for i in range(len(raw_data)):
    row = raw_data.iloc[i]
    date_idx = int(row[&#39;date_idx&#39;])
    element_idx = int(row[&#39;element_idx&#39;])
    for input_idx in range(len(inputs)):
        data[date_idx][input_idx][element_idx] = float(row[inputs[input_idx]])
print(data)

However, this is very slow. I have millions of entries for the date_idx array, and dozens for both inputs and element_idx. It takes 7 hours on my machine for this to complete with my real data set.

I have a feeling this could be done with slicing, no loops, but my attempts always fail - I'm missing something.

For example, I tried to eliminate the inner loop with:

for i in range(len(raw_data)):
    row = raw_data.iloc[i]
    date_idx = int(row[&#39;date_idx&#39;])
    element_idx = int(row[&#39;element_idx&#39;])
    data[date_idx][:][element_idx] = list(dict(row[inputs]).values())

And it fails with:

Traceback (most recent call last):
  File &quot;/home/stark/Work/mmr6/test2.py&quot;, line 84, in &lt;module&gt;
    data[date_idx][:][element_idx] = list(dict(row[inputs]).values())
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (3,) into shape (2,)

My question is, can slicing and / or fast technique be used to reorganize this DataFrame in that fashion on the plain numpy array, or do I really need the loops here?

答案1

得分: 0

我认为您正在寻找将数据框架进行枢轴操作，然后将其转换为NumPy数组的方法：
```py
num_unique_elements = raw_data['element_idx'].nunique()
num_unique_inputs = 3 # a, b, c
df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
df = df.stack(level=0)
print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))

打印输出：

[[[10 40]
  [11 41]
  [12 42]]
 [[20 50]
  [21 51]
  [22 52]]
 [[30 60]
  [31 61]
  [32 62]]]

操作步骤：

df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
print(df)
              a       b       c    
element_idx   0   1   0   1   0   1
date_idx                           
0            10  40  11  41  12  42
1            20  50  21  51  22  52
2            30  60  31  61  32  62

然后使用 .stack() 进行重塑：

df = df.stack(level=0)
print(df)
element_idx   0   1
date_idx           
0        a   10  40
         b   11  41
         c   12  42
1        a   20  50
         b   21  51
         c   22  52
2        a   30  60
         b   31  61
         c   32  62

最后将其转换为NumPy数组：

print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
[[[10 40]
  [11 41]
  [12 42]]
 [[20 50]
  [21 51]
  [22 52]]
 [[30 60]
  [31 61]
  [32 62]]]


<details>
<summary>英文:</summary>
I think you&#39;re searching for pivoting the dataframe and then convert it to numpy array:
```py
num_unique_elements = raw_data[&#39;element_idx&#39;].nunique()
num_unique_inputs = 3 # a, b, c
df = pd.pivot(raw_data, index=&#39;date_idx&#39;, columns=&#39;element_idx&#39;)
df = df.stack(level=0)
print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))

Prints:

[[[10 40]
  [11 41]
  [12 42]]
 [[20 50]
  [21 51]
  [22 52]]
 [[30 60]
  [31 61]
  [32 62]]]

Steps:

df = pd.pivot(raw_data, index=&#39;date_idx&#39;, columns=&#39;element_idx&#39;)
print(df)
              a       b       c    
element_idx   0   1   0   1   0   1
date_idx                           
0            10  40  11  41  12  42
1            20  50  21  51  22  52
2            30  60  31  61  32  62

Then reshape it using .stack()

df = df.stack(level=0)
print(df)
element_idx   0   1
date_idx           
0        a   10  40
         b   11  41
         c   12  42
1        a   20  50
         b   21  51
         c   22  52
2        a   30  60
         b   31  61
         c   32  62

Then convert it to numpy array:

print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
[[[10 40]
  [11 41]
  [12 42]]
 [[20 50]
  [21 51]
  [22 52]]
 [[30 60]
  [31 61]
  [32 62]]]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Speeding up loop for reorganizing pandas DataFrame into numpy array using slicing throws exception – what am I missing?

问题

答案1

用Julia如何定义一个带有动态类型列名和列类型的空DataFrame？

Python Flask TypeError: ‘async_generator’ object is not iterable

Python – 图像颜色外推 – KMeans 错误

AsyncMock协程实际上从不放弃控制？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。