英文:
Speeding up loop for reorganizing pandas DataFrame into numpy array using slicing throws exception - what am I missing?
问题
你可以使用以下代码来优化这个过程,避免使用显式的循环:
import pandas as pd
import numpy as np
# 原始数据
raw_data = pd.DataFrame({
'date_idx': [0, 1, 2, 0, 1, 2],
'element_idx': [0, 0, 0, 1, 1, 1],
'a': [10, 20, 30, 40, 50, 60],
'b': [11, 21, 31, 41, 51, 61],
'c': [12, 22, 32, 42, 52, 62],
})
# 定义列名
inputs = ['a', 'b', 'c']
# 获取唯一日期和元素索引值
unique_dates = raw_data['date_idx'].unique()
unique_elements = raw_data['element_idx'].unique()
# 创建一个新的numpy数组
data = np.zeros(shape=(len(unique_dates), len(inputs), len(unique_elements)), dtype=np.float64)
# 使用Pandas的pivot方法来重新排列数据
pivot_data = raw_data.pivot(index='date_idx', columns=['element_idx'], values=inputs)
# 转换为NumPy数组
data = pivot_data.to_numpy()
print(data)
这段代码使用了Pandas的pivot
方法,将原始数据重新排列成你所需的形式,然后将其转换为NumPy数组,避免了显式的循环,提高了运行速度。
英文:
I have a pandas DataFrame like so:
raw_data = DataFrame({
'date_idx': [0, 1, 2, 0, 1, 2],
'element_idx': [0, 0, 0, 1, 1, 1],
'a': [10, 20, 30, 40, 50, 60],
'b': [11, 21, 31, 41, 51, 61],
'c': [12, 22, 32, 42, 52, 62],
})
I call the columns other than date_idx
and element_idx
"inputs". I want to reorganize it into a 3d numpy array by date_idx
-> input_idx
-> element_idx
, so that the result is like so:
[[[10. 40.]
[11. 41.]
[12. 42.]]
[[20. 50.]
[21. 51.]
[22. 52.]]
[[30. 60.]
[31. 61.]
[32. 62.]]]
I did it with two for loops, and it works well:
date_idx = [0, 1, 2, 0, 1, 2]
element_idx = [0, 0, 0, 1, 1, 1]
raw_data = DataFrame({
'date_idx': date_idx,
'element_idx': element_idx,
'a': [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
'b': [11.0, 21.0, 31.0, 41.0, 51.0, 61.0],
'c': [12.0, 22.0, 32.0, 42.0, 52.0, 62.0],
})
inputs = ['a', 'b', 'c']
unique_dates = set(date_idx)
unique_elements = set(element_idx)
data = np.zeros(shape=(len(unique_dates), len(inputs), len(unique_elements)), dtype=np.float64)
for i in range(len(raw_data)):
row = raw_data.iloc[i]
date_idx = int(row['date_idx'])
element_idx = int(row['element_idx'])
for input_idx in range(len(inputs)):
data[date_idx][input_idx][element_idx] = float(row[inputs[input_idx]])
print(data)
However, this is very slow. I have millions of entries for the date_idx
array, and dozens for both inputs
and element_idx
. It takes 7 hours on my machine for this to complete with my real data set.
I have a feeling this could be done with slicing, no loops, but my attempts always fail - I'm missing something.
For example, I tried to eliminate the inner loop with:
for i in range(len(raw_data)):
row = raw_data.iloc[i]
date_idx = int(row['date_idx'])
element_idx = int(row['element_idx'])
data[date_idx][:][element_idx] = list(dict(row[inputs]).values())
And it fails with:
Traceback (most recent call last):
File "/home/stark/Work/mmr6/test2.py", line 84, in <module>
data[date_idx][:][element_idx] = list(dict(row[inputs]).values())
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (3,) into shape (2,)
My question is, can slicing and / or fast technique be used to reorganize this DataFrame in that fashion on the plain numpy array, or do I really need the loops here?
答案1
得分: 0
我认为您正在寻找将数据框架进行枢轴操作,然后将其转换为NumPy数组的方法:
```py
num_unique_elements = raw_data['element_idx'].nunique()
num_unique_inputs = 3 # a, b, c
df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
df = df.stack(level=0)
print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
打印输出:
[[[10 40]
[11 41]
[12 42]]
[[20 50]
[21 51]
[22 52]]
[[30 60]
[31 61]
[32 62]]]
操作步骤:
df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
print(df)
a b c
element_idx 0 1 0 1 0 1
date_idx
0 10 40 11 41 12 42
1 20 50 21 51 22 52
2 30 60 31 61 32 62
然后使用 .stack()
进行重塑:
df = df.stack(level=0)
print(df)
element_idx 0 1
date_idx
0 a 10 40
b 11 41
c 12 42
1 a 20 50
b 21 51
c 22 52
2 a 30 60
b 31 61
c 32 62
最后将其转换为NumPy数组:
print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
[[[10 40]
[11 41]
[12 42]]
[[20 50]
[21 51]
[22 52]]
[[30 60]
[31 61]
[32 62]]]
<details>
<summary>英文:</summary>
I think you're searching for pivoting the dataframe and then convert it to numpy array:
```py
num_unique_elements = raw_data['element_idx'].nunique()
num_unique_inputs = 3 # a, b, c
df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
df = df.stack(level=0)
print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
Prints:
[[[10 40]
[11 41]
[12 42]]
[[20 50]
[21 51]
[22 52]]
[[30 60]
[31 61]
[32 62]]]
Steps:
df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
print(df)
a b c
element_idx 0 1 0 1 0 1
date_idx
0 10 40 11 41 12 42
1 20 50 21 51 22 52
2 30 60 31 61 32 62
Then reshape it using .stack()
df = df.stack(level=0)
print(df)
element_idx 0 1
date_idx
0 a 10 40
b 11 41
c 12 42
1 a 20 50
b 21 51
c 22 52
2 a 30 60
b 31 61
c 32 62
Then convert it to numpy array:
print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
[[[10 40]
[11 41]
[12 42]]
[[20 50]
[21 51]
[22 52]]
[[30 60]
[31 61]
[32 62]]]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论