英文:
Can you select a variable by binning another variable in a vectorized way?
问题
问题:
我有几个变量 x
,我想使用一些区间来将它们排序并存储到变量 binned_list
中。
例如,x
是一个具有两个分量的随机向量,范围从 0 到 10/sqrt(2),我希望根据 x
的模将其排序到列表 binned_list
中。我有三个用于模的区间:[0, 3.33),[3.33, 6.66) 和 [6.66, 10),我想将不同迭代的 x
保存到 binned_list
中,它是一个包含 3 个列表的列表,每个列表对应于 x
的模值在该区间上的值。
我可以这样做:
import numpy as np
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list = [[] for b in range(N_bins)]
N_elements = 5
np.random.seed(1)
for k in range(3):
x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(x[:,0]**2 + x[:,1]**2)
dig_x = np.digitize( mod_x, bins = Bins ) - 1
for y in range(len(x)):
binned_list[dig_x[y]].append(x[y])
输出:
[[array([8.08752089e-04, 2.13781412e+00]),
array([1.03772086, 0.65293247]),
array([1.31705859, 2.44348333]),
array([0.99268556, 1.40078906]),
array([0.60135339, 0.27615902])],
[array([2.94879087, 5.09346334]),
array([2.80556972, 3.81000966]),
array([2.96415284, 4.84523355]),
array([1.44569572, 6.20922794]),
array([0.19365953, 4.74092123]),
array([2.95079056, 3.95053366]),
array([2.21624362, 4.89546016]),
array([1.20088241, 6.20940519])],
[array([5.66211915, 6.84664326]), array([6.19700713, 6.32582438])]]
问题:
一旦我对 x
的元素进行了数字化,是否可以避免循环遍历它们以将它们保存到变量 binned_list
中?我想以矢量化的方式完成这个操作,以使代码更高效。
我考虑过类似以下的方法:
binned_list[dig_x].append(x)
但是我不能使用数组来切片一个列表。而且如果我将 binned_list
定义为一个数组,我也不能使用 append
。
英文:
Problem
I have several variables x
that I want to sort in a variable binned_list
using some bins.
As an example, x
is a random vector in two components, going from 0 to 10/sqrt(2), that I want to sort in the list binned_list
by the modulus of x
. I have three bins for the modulus: [0, 3.33), [3.33, 6.66) and [6.66, 10) and I want to save different iterations of x
into binned_list
, which is a list of 3 lists, each one corresponding to values of the modulus of x
on that bin.
I can do it in the following way:
import numpy as np
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list = [[] for b in range(N_bins)]
N_elements = 5
np.random.seed(1)
for k in range(3):
x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(x[:,0]**2 + x[:,1]**2)
dig_x = np.digitize( mod_x, bins = Bins ) - 1
for y in range(len(x)):
binned_list[dig_x[y]].append(x[y])
Output:
[[array([8.08752089e-04, 2.13781412e+00]),
array([1.03772086, 0.65293247]),
array([1.31705859, 2.44348333]),
array([0.99268556, 1.40078906]),
array([0.60135339, 0.27615902])],
[array([2.94879087, 5.09346334]),
array([2.80556972, 3.81000966]),
array([2.96415284, 4.84523355]),
array([1.44569572, 6.20922794]),
array([0.19365953, 4.74092123]),
array([2.95079056, 3.95053366]),
array([2.21624362, 4.89546016]),
array([1.20088241, 6.20940519])],
[array([5.66211915, 6.84664326]), array([6.19700713, 6.32582438])]]
Question
Once I have digitized the elements of x
, can I avoid looping through them in order to save them in variable binned_list
? I would like to do this in a vectorized way in order to make the code more efficient.
I thought of something like:
binned_list[dig_x].append(x)
But I can't slice a list with an array. Also if I define binned_list
as an array I can't append either.
答案1
得分: 1
你可以避免嵌套的for循环,只需使用nonzero()
和掩码来循环遍历bins。
x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(np.sum(x**2,1))
dig_x = np.digitize(mod_x,bins=Bins)-1
for i in range(N_bins):
binned_list[i] = x[(dig_x==i).nonzero()]
英文:
You can avoid the nested for loop and only loop through the bins by using masking with nonzero()
x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(np.sum(x**2,1))
dig_x = np.digitize(mod_x,bins=Bins)-1
for i in range(N_bins):
binned_list[i] = x[(dig_x==i).nonzero()]
答案2
得分: 0
我比较了@mpw2的答案和我的答案,他的答案确实快了一点,对于我尝试的不同迭代次数和元素数量,快了约1.5-5倍:
import numpy as np
import time
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list1 = [[] for b in range(N_bins)]
binned_list2 = [[] for b in range(N_bins)]
N_elements = 10000
np.random.seed(1)
k_iter = 10
x = np.random.random((k_iter, N_elements,2))/np.sqrt(2)*10
start_time = time.time()
for k in range(k_iter):
mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
dig_x = np.digitize( mod_x, bins = Bins ) - 1
for y in range(len(x[k])):
binned_list1[dig_x[y]].append(x[k,y])
print("1: %s s" % (time.time() - start_time))
start_time = time.time()
for k in range(k_iter):
mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
dig_x = np.digitize( mod_x, bins = Bins ) - 1
for i in range(N_bins):
binned_list2[i].extend(x[k,(dig_x==i).nonzero()[0]])
print("2: %s s" % (time.time() - start_time))
for i in range(N_bins):
print(np.allclose(np.array(binned_list1[i]), np.array(binned_list2[i])))
输出
1: 0.039897918701171875 秒
2: 0.01396489143371582 秒
True
True
True
英文:
I compared @mpw2 answer and mine and his answer is indeed a bit faster, around a factor 1.5-5 faster for different number of iterations and number of elements that I tried:
import numpy as np
import time
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)
binned_list1 = [[] for b in range(N_bins)]
binned_list2 = [[] for b in range(N_bins)]
N_elements = 10000
np.random.seed(1)
k_iter = 10
x = np.random.random((k_iter, N_elements,2))/np.sqrt(2)*10
start_time = time.time()
for k in range(k_iter):
mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
dig_x = np.digitize( mod_x, bins = Bins ) - 1
for y in range(len(x[k])):
binned_list1[dig_x[y]].append(x[k,y])
print("1: %s s" % (time.time() - start_time))
start_time = time.time()
for k in range(k_iter):
mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
dig_x = np.digitize( mod_x, bins = Bins ) - 1
for i in range(N_bins):
binned_list2[i].extend(x[k,(dig_x==i).nonzero()[0]])
print("2: %s s" % (time.time() - start_time))
for i in range(N_bins):
print(np.allclose(np.array(binned_list1[i]), np.array(binned_list2[i])))
Output
1: 0.039897918701171875 s
2: 0.01396489143371582 s
True
True
True
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论