最快的方法是在先前的箱结果分组的数组上应用histcount。

huangapple go评论61阅读模式
英文:

Fastest way to apply histcount on an array grouped by previous bin result

问题

以下是您提供的代码的翻译部分:

我有两个大的NumPy数组需要根据一些分箱值进行分组第一个数组需要使用data1Bins值进行分组然后需要根据第一个数组上的分组结果对第二个数组中的数据进行分组完成这个分组后需要计算每个分组中值的数量

这个计数结果需要添加为数据帧的一行最后需要计算数据帧的总和以便可以将每个元素除以总和值

尽管我的工作解决方案有效但我想知道是否有更加优雅或更快的解决方案时间非常重要因为这个函数将被执行多次

所以总结一下我总是乐意听到有关这段小代码的可能改进当前的时间是`0.009527206420898438`。

**当前解决方案:**
```python
import pandas as pd
import numpy as np
import time

data1 = np.random.uniform(low=0, high=25, size=(50,))
data2 = np.random.uniform(low=0, high=25, size=(50,))

data1Bins = [0, *np.arange(1.5, 25, 1), 100]
data2Bins = [0, *np.arange(7.5, 360, 15), 360]

# 从这里开始加速 ->
start = time.time()
inds = np.digitize(data1, data1Bins)

df = pd.DataFrame()

# 25个分组
for i in range(0, len(data1Bins)):
    binned_data = data2[np.asarray(inds == i).nonzero()[0].tolist()]
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    count = np.array([(count[0] + count[-1]), *count[1:-1]])

    df = pd.concat([df, pd.DataFrame(count.reshape(-1, len(count)))])
    # 设置索引
df = df.reset_index(drop=True)

# 获取总和
total_sum = df.sum().sum()

# 每个元素除以总和
df = df/ total_sum

df['Name'] = 'abc'
df['Id'] = 'def'
df['Nr'] = np.arange(df.shape[0])

print(time.time() - start)

print(df)

最终结果:

        0         1         2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 Name   Id  Nr
0   0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  abc  def   0
(中间省略了一些结果,总共25行)
24  0.000000  0.081633  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  abc  def  24

请注意,这只是代码的翻译部分,不包括问题的回答。如果您有其他疑问或需要进一步的帮助,请随时告诉我。
<details>
<summary>英文:</summary>
I have 2 large numpy arrays which I need to bin according to some bin values. The first array needs to be binned with the data1Bins values. Then the data in the second array needs to be grouped by the result of the bins on the first array. When this grouping is done, the amount of values in each bin needs to be counted.
This counted result needs to be added as a row to a data frame and in the end the total sum of the data frames needs to be calculated so each element can be divided by the total sum value.
Despite my working solution, I&#39;m wondering if there isn&#39;t a more elegant or faster solution. Time is a very important thing since this function will be executed many times.
So that all said I&#39;m always happy hear possible improvements regarding this small piece of code. The current timing is `0.009527206420898438 s`.
**Current solution:**

import pandas as pd
import numpy as np
import time

data1 = np.random.uniform(low=0, high=25, size=(50,))
data2 = np.random.uniform(low=0, high=25, size=(50,))

data1Bins = [0, *np.arange(1.5, 25, 1), 100]
data2Bins = [0, *np.arange(7.5, 360, 15), 360]

Speed up from here ->

start = time.time()
inds = np.digitize(data1, data1Bins)

df = pd.DataFrame()

25 bins

for i in range(0, len(data1Bins)):
binned_data = data2[np.asarray(inds == i).nonzero()[0].tolist()]
count, bin_edges = np.histogram(binned_data, bins=data2Bins)
count = np.array([(count[0] + count[-1]), *count[1:-1]])

df = pd.concat([df, pd.DataFrame(count.reshape(-1, len(count)))])

Set index

df = df.reset_index(drop=True)

Get total sum

total_sum = df.sum().sum()

Devide each element by total sum

df = df/ total_sum

df['Name'] = 'abc'
df['Id'] = 'def'
df['Nr'] = np.arange(df.shape[0])

print(time.time() - start)

print(df)


**End result:**
    0         1         2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 Name   Id  Nr

0 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 0
1 0.020408 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 1
2 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 2
3 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 3
4 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 4
5 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 5
6 0.061224 0.040816 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 6
7 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 7
8 0.020408 0.081633 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 8
9 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 9
10 0.081633 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 10
11 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 11
12 0.000000 0.020408 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 12
13 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 13
14 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 14
15 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 15
16 0.000000 0.000000 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 16
17 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 17
18 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 18
19 0.040816 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 19
20 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 20
21 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 21
22 0.020408 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 22
23 0.020408 0.020408 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 23
24 0.000000 0.081633 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 24


</details>
# 答案1
**得分**: 1
以下是已翻译好的内容:
首先,这是我在执行代码后`# Speed up from here`注释之后在我的计算机上的基准定时:
```python
5.02 ms &#177; 134 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

通过将值附加到列表,然后创建数据框,而不是使用pd.concat(),您可以节省时间。

# [...]

data = []

for i in range(0, len(data1Bins)):
    # cleaner way to get binned_data
    binned_data = data2[np.where(inds == i)]  
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    
    # cleaner way to get the values added to the dataframe
    values = np.concatenate(([count[0] + count[-1]], count[1:-1]))
    data.append(values)

df = pd.DataFrame(data)

# [...]
1.97 ms &#177; 36.5 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

这里还有一些改进,虽然不会节省太多时间,但会使您的代码更清晰。

for i in range(0, len(data1Bins)):
    # cleaner way to get binned_data
    binned_data = data2[np.where(inds == i)]  
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    
    # cleaner way to get the values added to the dataframe
    values = np.concatenate(([count[0] + count[-1]], count[1:-1]))
    data.append(values)
1.93 ms &#177; 45.4 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)
英文:

First, here is the baseline timing I have on my machine when executing the code after the # Speed up from here comment :

5.02 ms &#177; 134 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

You will save time by appending values to a list and then creating a dataframe, rather than using pd.concat().

# [...]

data = []

for i in range(0, len(data1Bins)):
    binned_data = data2[np.asarray(inds == i).nonzero()[0].tolist()]
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    count = np.array([(count[0] + count[-1]), *count[1:-1]])
    data.append(count)

df = pd.DataFrame(data)

# [...]
1.97 ms &#177; 36.5 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

Here are also a few improvements that will save non-significant time, but will make your code cleaner

for i in range(0, len(data1Bins)):
    # cleaner way to get binned_data
    binned_data = data2[np.where(inds == i)]  
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    
    # cleaner way to get the values added to the dataframe
    values = np.concatenate(([count[0] + count[-1]], count[1:-1]))
    data.append(values)
1.93 ms &#177; 45.4 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

huangapple
  • 本文由 发表于 2023年3月4日 07:12:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75632612.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定