2023年6月19日 20:30:58go评论118阅读模式

英文:

Can you select a variable by binning another variable in a vectorized way?

问题

问题：

我有几个变量 x，我想使用一些区间来将它们排序并存储到变量 binned_list 中。

例如，x 是一个具有两个分量的随机向量，范围从 0 到 10/sqrt(2)，我希望根据 x 的模将其排序到列表 binned_list 中。我有三个用于模的区间：[0, 3.33)，[3.33, 6.66) 和 [6.66, 10)，我想将不同迭代的 x 保存到 binned_list 中，它是一个包含 3 个列表的列表，每个列表对应于 x 的模值在该区间上的值。

我可以这样做：

import numpy as np
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list = [[] for b in range(N_bins)]
N_elements = 5
np.random.seed(1)
for k in range(3):
    x = np.random.random((N_elements,2))/np.sqrt(2)*10
    mod_x = np.sqrt(x[:,0]**2 + x[:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1
    for y in range(len(x)):
        binned_list[dig_x[y]].append(x[y])

输出：

[[array([8.08752089e-04, 2.13781412e+00]),
  array([1.03772086, 0.65293247]),
  array([1.31705859, 2.44348333]),
  array([0.99268556, 1.40078906]),
  array([0.60135339, 0.27615902])],
 [array([2.94879087, 5.09346334]),
  array([2.80556972, 3.81000966]),
  array([2.96415284, 4.84523355]),
  array([1.44569572, 6.20922794]),
  array([0.19365953, 4.74092123]),
  array([2.95079056, 3.95053366]),
  array([2.21624362, 4.89546016]),
  array([1.20088241, 6.20940519])],
 [array([5.66211915, 6.84664326]), array([6.19700713, 6.32582438])]]

问题：

一旦我对 x 的元素进行了数字化，是否可以避免循环遍历它们以将它们保存到变量 binned_list 中？我想以矢量化的方式完成这个操作，以使代码更高效。

我考虑过类似以下的方法：

binned_list[dig_x].append(x)

但是我不能使用数组来切片一个列表。而且如果我将 binned_list 定义为一个数组，我也不能使用 append。

英文:

Problem

I have several variables x that I want to sort in a variable binned_list using some bins.

As an example, x is a random vector in two components, going from 0 to 10/sqrt(2), that I want to sort in the list binned_list by the modulus of x. I have three bins for the modulus: [0, 3.33), [3.33, 6.66) and [6.66, 10) and I want to save different iterations of x into binned_list, which is a list of 3 lists, each one corresponding to values of the modulus of x on that bin.

I can do it in the following way:

import numpy as np
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list = [[] for b in range(N_bins)]
N_elements = 5
np.random.seed(1)
for k in range(3):
    x = np.random.random((N_elements,2))/np.sqrt(2)*10
    mod_x = np.sqrt(x[:,0]**2 + x[:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1
    for y in range(len(x)):
        binned_list[dig_x[y]].append(x[y])

Output:

[[array([8.08752089e-04, 2.13781412e+00]),
  array([1.03772086, 0.65293247]),
  array([1.31705859, 2.44348333]),
  array([0.99268556, 1.40078906]),
  array([0.60135339, 0.27615902])],
 [array([2.94879087, 5.09346334]),
  array([2.80556972, 3.81000966]),
  array([2.96415284, 4.84523355]),
  array([1.44569572, 6.20922794]),
  array([0.19365953, 4.74092123]),
  array([2.95079056, 3.95053366]),
  array([2.21624362, 4.89546016]),
  array([1.20088241, 6.20940519])],
 [array([5.66211915, 6.84664326]), array([6.19700713, 6.32582438])]]

Question

Once I have digitized the elements of x, can I avoid looping through them in order to save them in variable binned_list? I would like to do this in a vectorized way in order to make the code more efficient.

I thought of something like:

binned_list[dig_x].append(x)

But I can't slice a list with an array. Also if I define binned_list as an array I can't append either.

答案1

得分: 1

你可以避免嵌套的for循环，只需使用nonzero()和掩码来循环遍历bins。

x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(np.sum(x**2,1))
dig_x = np.digitize(mod_x,bins=Bins)-1
for i in range(N_bins):
    binned_list[i] = x[(dig_x==i).nonzero()]

英文:

You can avoid the nested for loop and only loop through the bins by using masking with nonzero()

x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(np.sum(x**2,1))
dig_x = np.digitize(mod_x,bins=Bins)-1
for i in range(N_bins):
    binned_list[i] = x[(dig_x==i).nonzero()]

答案2

得分: 0

我比较了@mpw2的答案和我的答案，他的答案确实快了一点，对于我尝试的不同迭代次数和元素数量，快了约1.5-5倍：

import numpy as np
import time
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list1 = [[] for b in range(N_bins)]
binned_list2 = [[] for b in range(N_bins)]
N_elements = 10000
np.random.seed(1)
k_iter = 10
x = np.random.random((k_iter, N_elements,2))/np.sqrt(2)*10
start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1
    for y in range(len(x[k])):
        binned_list1[dig_x[y]].append(x[k,y])
print("1: %s s" % (time.time() - start_time))
start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1
    for i in range(N_bins):
        binned_list2[i].extend(x[k,(dig_x==i).nonzero()[0]])
print("2: %s s" % (time.time() - start_time))
for i in range(N_bins):
    print(np.allclose(np.array(binned_list1[i]), np.array(binned_list2[i])))

输出

1: 0.039897918701171875 秒
2: 0.01396489143371582 秒
True
True
True

英文:

I compared @mpw2 answer and mine and his answer is indeed a bit faster, around a factor 1.5-5 faster for different number of iterations and number of elements that I tried:

import numpy as np
import time
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list1 = [[] for b in range(N_bins)]
binned_list2 = [[] for b in range(N_bins)]
N_elements = 10000
np.random.seed(1)
k_iter = 10
x = np.random.random((k_iter, N_elements,2))/np.sqrt(2)*10
start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1
    for y in range(len(x[k])):
        binned_list1[dig_x[y]].append(x[k,y])
print(&quot;1: %s s&quot; % (time.time() - start_time))
start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1
    for i in range(N_bins):
        binned_list2[i].extend(x[k,(dig_x==i).nonzero()[0]])
print(&quot;2: %s s&quot; % (time.time() - start_time))
for i in range(N_bins):
    print(np.allclose(np.array(binned_list1[i]), np.array(binned_list2[i])))

Output

1: 0.039897918701171875 s
2: 0.01396489143371582 s
True
True
True

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

你可以通过向量化的方式对另一个变量进行分箱选择一个变量吗？

问题

答案1

答案2

有办法在Python的Selenium中编辑HTML吗？

从Pyspark数据帧中创建字典时显示OutOfMemoryError: Java堆空间。

Python：在“实例方法”的第一个参数中使用self而不是self

改变表格中字段的颜色，取决于用户支付剩余的时间，使用 Django。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。