Speeding up loop for reorganizing pandas DataFrame into numpy array using slicing throws exception – what am I missing?

huangapple go评论119阅读模式
英文:

Speeding up loop for reorganizing pandas DataFrame into numpy array using slicing throws exception - what am I missing?

问题

你可以使用以下代码来优化这个过程,避免使用显式的循环:

  1. import pandas as pd
  2. import numpy as np
  3. # 原始数据
  4. raw_data = pd.DataFrame({
  5. 'date_idx': [0, 1, 2, 0, 1, 2],
  6. 'element_idx': [0, 0, 0, 1, 1, 1],
  7. 'a': [10, 20, 30, 40, 50, 60],
  8. 'b': [11, 21, 31, 41, 51, 61],
  9. 'c': [12, 22, 32, 42, 52, 62],
  10. })
  11. # 定义列名
  12. inputs = ['a', 'b', 'c']
  13. # 获取唯一日期和元素索引值
  14. unique_dates = raw_data['date_idx'].unique()
  15. unique_elements = raw_data['element_idx'].unique()
  16. # 创建一个新的numpy数组
  17. data = np.zeros(shape=(len(unique_dates), len(inputs), len(unique_elements)), dtype=np.float64)
  18. # 使用Pandas的pivot方法来重新排列数据
  19. pivot_data = raw_data.pivot(index='date_idx', columns=['element_idx'], values=inputs)
  20. # 转换为NumPy数组
  21. data = pivot_data.to_numpy()
  22. print(data)

这段代码使用了Pandas的pivot方法,将原始数据重新排列成你所需的形式,然后将其转换为NumPy数组,避免了显式的循环,提高了运行速度。

英文:

I have a pandas DataFrame like so:

  1. raw_data = DataFrame({
  2. 'date_idx': [0, 1, 2, 0, 1, 2],
  3. 'element_idx': [0, 0, 0, 1, 1, 1],
  4. 'a': [10, 20, 30, 40, 50, 60],
  5. 'b': [11, 21, 31, 41, 51, 61],
  6. 'c': [12, 22, 32, 42, 52, 62],
  7. })

I call the columns other than date_idx and element_idx "inputs". I want to reorganize it into a 3d numpy array by date_idx -> input_idx -> element_idx, so that the result is like so:

  1. [[[10. 40.]
  2. [11. 41.]
  3. [12. 42.]]
  4. [[20. 50.]
  5. [21. 51.]
  6. [22. 52.]]
  7. [[30. 60.]
  8. [31. 61.]
  9. [32. 62.]]]

I did it with two for loops, and it works well:

  1. date_idx = [0, 1, 2, 0, 1, 2]
  2. element_idx = [0, 0, 0, 1, 1, 1]
  3. raw_data = DataFrame({
  4. 'date_idx': date_idx,
  5. 'element_idx': element_idx,
  6. 'a': [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
  7. 'b': [11.0, 21.0, 31.0, 41.0, 51.0, 61.0],
  8. 'c': [12.0, 22.0, 32.0, 42.0, 52.0, 62.0],
  9. })
  10. inputs = ['a', 'b', 'c']
  11. unique_dates = set(date_idx)
  12. unique_elements = set(element_idx)
  13. data = np.zeros(shape=(len(unique_dates), len(inputs), len(unique_elements)), dtype=np.float64)
  14. for i in range(len(raw_data)):
  15. row = raw_data.iloc[i]
  16. date_idx = int(row['date_idx'])
  17. element_idx = int(row['element_idx'])
  18. for input_idx in range(len(inputs)):
  19. data[date_idx][input_idx][element_idx] = float(row[inputs[input_idx]])
  20. print(data)

However, this is very slow. I have millions of entries for the date_idx array, and dozens for both inputs and element_idx. It takes 7 hours on my machine for this to complete with my real data set.

I have a feeling this could be done with slicing, no loops, but my attempts always fail - I'm missing something.

For example, I tried to eliminate the inner loop with:

  1. for i in range(len(raw_data)):
  2. row = raw_data.iloc[i]
  3. date_idx = int(row['date_idx'])
  4. element_idx = int(row['element_idx'])
  5. data[date_idx][:][element_idx] = list(dict(row[inputs]).values())

And it fails with:

  1. Traceback (most recent call last):
  2. File "/home/stark/Work/mmr6/test2.py", line 84, in <module>
  3. data[date_idx][:][element_idx] = list(dict(row[inputs]).values())
  4. ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  5. ValueError: could not broadcast input array from shape (3,) into shape (2,)

My question is, can slicing and / or fast technique be used to reorganize this DataFrame in that fashion on the plain numpy array, or do I really need the loops here?

答案1

得分: 0

  1. 我认为您正在寻找将数据框架进行枢轴操作,然后将其转换为NumPy数组的方法:
  2. ```py
  3. num_unique_elements = raw_data['element_idx'].nunique()
  4. num_unique_inputs = 3 # a, b, c
  5. df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
  6. df = df.stack(level=0)
  7. print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))

打印输出:

  1. [[[10 40]
  2. [11 41]
  3. [12 42]]
  4. [[20 50]
  5. [21 51]
  6. [22 52]]
  7. [[30 60]
  8. [31 61]
  9. [32 62]]]

操作步骤:

  1. df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
  2. print(df)
  3. a b c
  4. element_idx 0 1 0 1 0 1
  5. date_idx
  6. 0 10 40 11 41 12 42
  7. 1 20 50 21 51 22 52
  8. 2 30 60 31 61 32 62

然后使用 .stack() 进行重塑:

  1. df = df.stack(level=0)
  2. print(df)
  3. element_idx 0 1
  4. date_idx
  5. 0 a 10 40
  6. b 11 41
  7. c 12 42
  8. 1 a 20 50
  9. b 21 51
  10. c 22 52
  11. 2 a 30 60
  12. b 31 61
  13. c 32 62

最后将其转换为NumPy数组:

  1. print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
  2. [[[10 40]
  3. [11 41]
  4. [12 42]]
  5. [[20 50]
  6. [21 51]
  7. [22 52]]
  8. [[30 60]
  9. [31 61]
  10. [32 62]]]
  1. <details>
  2. <summary>英文:</summary>
  3. I think you&#39;re searching for pivoting the dataframe and then convert it to numpy array:
  4. ```py
  5. num_unique_elements = raw_data[&#39;element_idx&#39;].nunique()
  6. num_unique_inputs = 3 # a, b, c
  7. df = pd.pivot(raw_data, index=&#39;date_idx&#39;, columns=&#39;element_idx&#39;)
  8. df = df.stack(level=0)
  9. print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))

Prints:

  1. [[[10 40]
  2. [11 41]
  3. [12 42]]
  4. [[20 50]
  5. [21 51]
  6. [22 52]]
  7. [[30 60]
  8. [31 61]
  9. [32 62]]]

Steps:

  1. df = pd.pivot(raw_data, index=&#39;date_idx&#39;, columns=&#39;element_idx&#39;)
  2. print(df)
  3. a b c
  4. element_idx 0 1 0 1 0 1
  5. date_idx
  6. 0 10 40 11 41 12 42
  7. 1 20 50 21 51 22 52
  8. 2 30 60 31 61 32 62

Then reshape it using .stack()

  1. df = df.stack(level=0)
  2. print(df)
  3. element_idx 0 1
  4. date_idx
  5. 0 a 10 40
  6. b 11 41
  7. c 12 42
  8. 1 a 20 50
  9. b 21 51
  10. c 22 52
  11. 2 a 30 60
  12. b 31 61
  13. c 32 62

Then convert it to numpy array:

  1. print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))
  2. [[[10 40]
  3. [11 41]
  4. [12 42]]
  5. [[20 50]
  6. [21 51]
  7. [22 52]]
  8. [[30 60]
  9. [31 61]
  10. [32 62]]]

huangapple
  • 本文由 发表于 2023年6月5日 01:12:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76401543.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定