2023年3月4日 07:12:47go评论100阅读模式

英文:

Fastest way to apply histcount on an array grouped by previous bin result

问题

以下是您提供的代码的翻译部分：

我有两个大的NumPy数组，需要根据一些分箱值进行分组。第一个数组需要使用data1Bins值进行分组。然后，需要根据第一个数组上的分组结果对第二个数组中的数据进行分组。完成这个分组后，需要计算每个分组中值的数量。
这个计数结果需要添加为数据帧的一行，最后需要计算数据帧的总和，以便可以将每个元素除以总和值。
尽管我的工作解决方案有效，但我想知道是否有更加优雅或更快的解决方案。时间非常重要，因为这个函数将被执行多次。
所以，总结一下，我总是乐意听到有关这段小代码的可能改进。当前的时间是`0.009527206420898438秒`。
**当前解决方案:**
```python
import pandas as pd
import numpy as np
import time
data1 = np.random.uniform(low=0, high=25, size=(50,))
data2 = np.random.uniform(low=0, high=25, size=(50,))
data1Bins = [0, *np.arange(1.5, 25, 1), 100]
data2Bins = [0, *np.arange(7.5, 360, 15), 360]
# 从这里开始加速 -&gt;
start = time.time()
inds = np.digitize(data1, data1Bins)
df = pd.DataFrame()
# 25个分组
for i in range(0, len(data1Bins)):
    binned_data = data2[np.asarray(inds == i).nonzero()[0].tolist()]
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    count = np.array([(count[0] + count[-1]), *count[1:-1]])
    df = pd.concat([df, pd.DataFrame(count.reshape(-1, len(count)))])
    # 设置索引
df = df.reset_index(drop=True)
# 获取总和
total_sum = df.sum().sum()
# 每个元素除以总和
df = df/ total_sum
df[&#39;Name&#39;] = &#39;abc&#39;
df[&#39;Id&#39;] = &#39;def&#39;
df[&#39;Nr&#39;] = np.arange(df.shape[0])
print(time.time() - start)
print(df)

最终结果:

        0         1         2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 Name   Id  Nr
0   0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  abc  def   0
（中间省略了一些结果，总共25行）
24  0.000000  0.081633  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  abc  def  24


请注意，这只是代码的翻译部分，不包括问题的回答。如果您有其他疑问或需要进一步的帮助，请随时告诉我。
<details>
<summary>英文:</summary>
I have 2 large numpy arrays which I need to bin according to some bin values. The first array needs to be binned with the data1Bins values. Then the data in the second array needs to be grouped by the result of the bins on the first array. When this grouping is done, the amount of values in each bin needs to be counted.
This counted result needs to be added as a row to a data frame and in the end the total sum of the data frames needs to be calculated so each element can be divided by the total sum value.
Despite my working solution, I&#39;m wondering if there isn&#39;t a more elegant or faster solution. Time is a very important thing since this function will be executed many times.
So that all said I&#39;m always happy hear possible improvements regarding this small piece of code. The current timing is `0.009527206420898438 s`.
**Current solution:**

import pandas as pd
import numpy as np
import time

data1 = np.random.uniform(low=0, high=25, size=(50,))
data2 = np.random.uniform(low=0, high=25, size=(50,))

data1Bins = [0, *np.arange(1.5, 25, 1), 100]
data2Bins = [0, *np.arange(7.5, 360, 15), 360]

Speed up from here ->

start = time.time()
inds = np.digitize(data1, data1Bins)

df = pd.DataFrame()

25 bins

for i in range(0, len(data1Bins)):
binned_data = data2[np.asarray(inds == i).nonzero()[0].tolist()]
count, bin_edges = np.histogram(binned_data, bins=data2Bins)
count = np.array([(count[0] + count[-1]), *count[1:-1]])

df = pd.concat([df, pd.DataFrame(count.reshape(-1, len(count)))])

Set index

df = df.reset_index(drop=True)

Get total sum

total_sum = df.sum().sum()

Devide each element by total sum

df = df/ total_sum

df['Name'] = 'abc'
df['Id'] = 'def'
df['Nr'] = np.arange(df.shape[0])

print(time.time() - start)

print(df)


**End result:**

    0         1         2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 Name   Id  Nr

0 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 0
1 0.020408 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 1
2 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 2
3 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 3
4 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 4
5 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 5
6 0.061224 0.040816 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 6
7 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 7
8 0.020408 0.081633 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 8
9 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 9
10 0.081633 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 10
11 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 11
12 0.000000 0.020408 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 12
13 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 13
14 0.000000 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 14
15 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 15
16 0.000000 0.000000 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 16
17 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 17
18 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 18
19 0.040816 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 19
20 0.000000 0.020408 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 20
21 0.020408 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 21
22 0.020408 0.040816 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 22
23 0.020408 0.020408 0.020408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 23
24 0.000000 0.081633 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 abc def 24


</details>
# 答案1
**得分**: 1
以下是已翻译好的内容：
首先，这是我在执行代码后`# Speed up from here`注释之后在我的计算机上的基准定时：
```python
5.02 ms &#177; 134 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

通过将值附加到列表，然后创建数据框，而不是使用pd.concat()，您可以节省时间。

# [...]
data = []
for i in range(0, len(data1Bins)):
    # cleaner way to get binned_data
    binned_data = data2[np.where(inds == i)]  
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    
    # cleaner way to get the values added to the dataframe
    values = np.concatenate(([count[0] + count[-1]], count[1:-1]))
    data.append(values)
df = pd.DataFrame(data)
# [...]

1.97 ms &#177; 36.5 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

这里还有一些改进，虽然不会节省太多时间，但会使您的代码更清晰。

for i in range(0, len(data1Bins)):
    # cleaner way to get binned_data
    binned_data = data2[np.where(inds == i)]  
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    
    # cleaner way to get the values added to the dataframe
    values = np.concatenate(([count[0] + count[-1]], count[1:-1]))
    data.append(values)

1.93 ms &#177; 45.4 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

英文:

First, here is the baseline timing I have on my machine when executing the code after the # Speed up from here comment :

5.02 ms &#177; 134 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

You will save time by appending values to a list and then creating a dataframe, rather than using pd.concat().

# [...]
data = []
for i in range(0, len(data1Bins)):
    binned_data = data2[np.asarray(inds == i).nonzero()[0].tolist()]
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    count = np.array([(count[0] + count[-1]), *count[1:-1]])
    data.append(count)
df = pd.DataFrame(data)
# [...]

1.97 ms &#177; 36.5 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

Here are also a few improvements that will save non-significant time, but will make your code cleaner

for i in range(0, len(data1Bins)):
    # cleaner way to get binned_data
    binned_data = data2[np.where(inds == i)]  
    count, bin_edges = np.histogram(binned_data, bins=data2Bins)
    
    # cleaner way to get the values added to the dataframe
    values = np.concatenate(([count[0] + count[-1]], count[1:-1]))
    data.append(values)

1.93 ms &#177; 45.4 &#181;s per loop (mean &#177; std. dev. of 10 runs, 1,000 loops each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

最快的方法是在先前的箱结果分组的数组上应用histcount。

问题

Speed up from here ->

25 bins

Set index

Get total sum

Devide each element by total sum

循环用于创建多个列表

如何在pyproject.toml中为自定义脚本添加快捷方式（使用poetry）

将字符串列表转换为（对象）列表在Pandas中如何做？

Python 2语法错误在Python 3环境中使用ape工具中发生。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。