你可以通过向量化的方式对另一个变量进行分箱选择一个变量吗?

huangapple go评论70阅读模式
英文:

Can you select a variable by binning another variable in a vectorized way?

问题

问题:

我有几个变量 x,我想使用一些区间来将它们排序并存储到变量 binned_list 中。

例如,x 是一个具有两个分量的随机向量,范围从 0 到 10/sqrt(2),我希望根据 x 的模将其排序到列表 binned_list 中。我有三个用于模的区间:[0, 3.33),[3.33, 6.66) 和 [6.66, 10),我想将不同迭代的 x 保存到 binned_list 中,它是一个包含 3 个列表的列表,每个列表对应于 x 的模值在该区间上的值。

我可以这样做:

import numpy as np
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)

binned_list = [[] for b in range(N_bins)]

N_elements = 5

np.random.seed(1)

for k in range(3):
    x = np.random.random((N_elements,2))/np.sqrt(2)*10
    mod_x = np.sqrt(x[:,0]**2 + x[:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1

    for y in range(len(x)):
        binned_list[dig_x[y]].append(x[y])

输出:

[[array([8.08752089e-04, 2.13781412e+00]),
  array([1.03772086, 0.65293247]),
  array([1.31705859, 2.44348333]),
  array([0.99268556, 1.40078906]),
  array([0.60135339, 0.27615902])],
 [array([2.94879087, 5.09346334]),
  array([2.80556972, 3.81000966]),
  array([2.96415284, 4.84523355]),
  array([1.44569572, 6.20922794]),
  array([0.19365953, 4.74092123]),
  array([2.95079056, 3.95053366]),
  array([2.21624362, 4.89546016]),
  array([1.20088241, 6.20940519])],
 [array([5.66211915, 6.84664326]), array([6.19700713, 6.32582438])]]

问题:

一旦我对 x 的元素进行了数字化,是否可以避免循环遍历它们以将它们保存到变量 binned_list 中?我想以矢量化的方式完成这个操作,以使代码更高效。

我考虑过类似以下的方法:

binned_list[dig_x].append(x)

但是我不能使用数组来切片一个列表。而且如果我将 binned_list 定义为一个数组,我也不能使用 append

英文:

Problem

I have several variables x that I want to sort in a variable binned_list using some bins.

As an example, x is a random vector in two components, going from 0 to 10/sqrt(2), that I want to sort in the list binned_list by the modulus of x. I have three bins for the modulus: [0, 3.33), [3.33, 6.66) and [6.66, 10) and I want to save different iterations of x into binned_list, which is a list of 3 lists, each one corresponding to values of the modulus of x on that bin.

I can do it in the following way:

import numpy as np
N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)

binned_list = [[] for b in range(N_bins)]

N_elements = 5

np.random.seed(1)

for k in range(3):
    x = np.random.random((N_elements,2))/np.sqrt(2)*10
    mod_x = np.sqrt(x[:,0]**2 + x[:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1

    for y in range(len(x)):
        binned_list[dig_x[y]].append(x[y])

Output:

[[array([8.08752089e-04, 2.13781412e+00]),
  array([1.03772086, 0.65293247]),
  array([1.31705859, 2.44348333]),
  array([0.99268556, 1.40078906]),
  array([0.60135339, 0.27615902])],
 [array([2.94879087, 5.09346334]),
  array([2.80556972, 3.81000966]),
  array([2.96415284, 4.84523355]),
  array([1.44569572, 6.20922794]),
  array([0.19365953, 4.74092123]),
  array([2.95079056, 3.95053366]),
  array([2.21624362, 4.89546016]),
  array([1.20088241, 6.20940519])],
 [array([5.66211915, 6.84664326]), array([6.19700713, 6.32582438])]]

Question

Once I have digitized the elements of x, can I avoid looping through them in order to save them in variable binned_list? I would like to do this in a vectorized way in order to make the code more efficient.

I thought of something like:

binned_list[dig_x].append(x)

But I can't slice a list with an array. Also if I define binned_list as an array I can't append either.

答案1

得分: 1

你可以避免嵌套的for循环,只需使用nonzero()和掩码来循环遍历bins。

x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(np.sum(x**2,1))
dig_x = np.digitize(mod_x,bins=Bins)-1
for i in range(N_bins):
    binned_list[i] = x[(dig_x==i).nonzero()]
英文:

You can avoid the nested for loop and only loop through the bins by using masking with nonzero()

x = np.random.random((N_elements,2))/np.sqrt(2)*10
mod_x = np.sqrt(np.sum(x**2,1))
dig_x = np.digitize(mod_x,bins=Bins)-1
for i in range(N_bins):
    binned_list[i] = x[(dig_x==i).nonzero()]

答案2

得分: 0

我比较了@mpw2的答案和我的答案,他的答案确实快了一点,对于我尝试的不同迭代次数和元素数量,快了约1.5-5倍:

import numpy as np
import time

N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)

binned_list1 = [[] for b in range(N_bins)]
binned_list2 = [[] for b in range(N_bins)]


N_elements = 10000

np.random.seed(1)

k_iter = 10

x = np.random.random((k_iter, N_elements,2))/np.sqrt(2)*10


start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1

    for y in range(len(x[k])):
        binned_list1[dig_x[y]].append(x[k,y])

print("1: %s s" % (time.time() - start_time))

start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1

    for i in range(N_bins):
        binned_list2[i].extend(x[k,(dig_x==i).nonzero()[0]])

print("2: %s s" % (time.time() - start_time))

for i in range(N_bins):
    print(np.allclose(np.array(binned_list1[i]), np.array(binned_list2[i])))

输出

1: 0.0398979187011718752: 0.01396489143371582True
True
True
英文:

I compared @mpw2 answer and mine and his answer is indeed a bit faster, around a factor 1.5-5 faster for different number of iterations and number of elements that I tried:

import numpy as np
import time

N_bins = 3
Bins = np.linspace(0, 10, N_bins+1)

binned_list1 = [[] for b in range(N_bins)]
binned_list2 = [[] for b in range(N_bins)]


N_elements = 10000

np.random.seed(1)

k_iter = 10

x = np.random.random((k_iter, N_elements,2))/np.sqrt(2)*10


start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1

    for y in range(len(x[k])):
        binned_list1[dig_x[y]].append(x[k,y])

print("1: %s s" % (time.time() - start_time))

start_time = time.time()
for k in range(k_iter):
    mod_x = np.sqrt(x[k,:,0]**2 + x[k,:,1]**2)
    dig_x = np.digitize( mod_x, bins = Bins ) - 1

    for i in range(N_bins):
        binned_list2[i].extend(x[k,(dig_x==i).nonzero()[0]])

print("2: %s s" % (time.time() - start_time))

for i in range(N_bins):
    print(np.allclose(np.array(binned_list1[i]), np.array(binned_list2[i])))

Output

1: 0.039897918701171875 s
2: 0.01396489143371582 s
True
True
True

huangapple
  • 本文由 发表于 2023年6月19日 20:30:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76506652.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定