英文:
Convert one column to specific number of columns
问题
我试图将一列数据从range(0,5)
中的值转换为6列,根据它们的值分配到相应的列。例如,如果值为0,则这六列中的第一列变为1,其他列变为0,依此类推。然而,由于我的目标数据的形状是(1034892, 1),这需要很长的时间,甚至有时会崩溃。这段代码在处理500000个数据时是有效的,但对于这么多数据则不行。
有没有办法让它在这么多数据的情况下也能运行?
def convert_to_num_class(target):
for i, value in enumerate(target):
if i == 0:
y_new = np.array(np.eye(6)[int(value[0])])
else:
y_new = np.vstack((y_new, np.eye(6)[int(value[0])]))
return y_new
英文:
I'm trying to convert one column of data which has values in range(0,5)
to 6 columns according to their value.
For example if its value is 0 the first column of those six one becomes one and other become 0 and so on.However since the shape of my target is (1034892, 1) it takes a lot of time and even sometime it crashes. This code has worked for 500000 of data but for this amount it does not.
Is there any way to make it possible for this amount of data?
def convert_to_num_class(target):
for i, value in enumerate(target):
if i ==0:
y_new =np.array( np.eye(6)[int(value[0])])
else:
y_new = np.vstack((y_new, np.eye(6)[int(value[0])]))
return(y_new)
答案1
得分: 1
使用pandas的get_dummies
函数:
>>> target = np.random.randint(6, size=(10, 1)) # 原始目标数组的形状为(1034892, 1)
>>> target = target.flatten()
array([0, 1, 0, 0, 4, 3, 1, 5, 4, 5])
>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 1, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]])
如果你的目标数组中不包含你所需的全部值(就像上面的示例中target
中没有值2),那么会出现缺失的列。一个解决方法如下:
>>> target = pd.Categorical(target, categories=np.arange(6))
>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1]])
即使对于大小为(1034892, 1)
的target
也非常快。
英文:
Using pandas get_dummies
:
>>> target = np.random.randint(6, size=(10, 1)) # the original target is of shape (1034892, 1)
>>> target = target.flatten()
array([0, 1, 0, 0, 4, 3, 1, 5, 4, 5])
>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 1, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]])
In case your target doesn't have all the values in the range you want (as in the above example where target
doesn't have the value 2), there will be missing columns for that missing values. One workaround is the following:
>>> target = pd.Categorical(target, categories=np.arange(6))
>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1]])
It is very fast even for target
of size (1034892, 1)
that you have.
答案2
得分: 0
我也通过使用keras.utils.np_utils
的to_categorical
方法来解决了这个问题,对于这么多的数据只需要一秒钟:
from keras.utils.np_utils import to_categorical
def convert_to_num_class(target):
target = target.astype(np.int)
return to_categorical(target, len(np.unique(target)))
英文:
I also have solved it by using to_categorical of keras.utils.np_utils and it takes just a second for this amount of data:
from keras.utils.np_utils import to_categorical
def convert_to_num_class(target):
target = target.astype(np.int)
return(to_categorical(target, len(np.unique(target))))
答案3
得分: 0
不需要使用 pandas
或 keras
,只需使用元组进行索引:
import numpy as np
categories = 6
N = 10
target = np.random.randint(categories, size=(N,1)) # 这应该是您的数据
y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target.flatten())
y[mask] = 1
性能检查:
def one_hot(target, categories=None):
target = target.flatten()
N = target.size
if categories is None:
categories = target.max() - target.min() + 1
y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target)
y[mask] = 1
return y
N = 1034892
cats = 6
r = np.random.randint(cats, size=(N))
%timeit one_hot(r)
# 9.63 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
import pandas as pd
%timeit pd.get_dummies(r).to_numpy()
# 18.2 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
英文:
No need to resort to pandas
or keras
, just index using a tuple:
import numpy as np
categories = 6
N = 10
target = np.random.randint(categories, size=(N,1)) # this should be your data
y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target.flatten())
y[mask] = 1
Performance check:
def one_hot(target, categories=None):
target = target.flatten()
N = target.size
if categories is None:
categories = target.max() - target.min() + 1
y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target)
y[mask] = 1
return y
N = 1034892
cats = 6
r = np.random.randint(cats, size=(N))
%timeit one_hot(r)
# 9.63 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
import pandas as pd
%timeit pd.get_dummies(r).to_numpy()
# 18.2 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论