将一列转换为特定数量的列

huangapple go评论85阅读模式
英文:

Convert one column to specific number of columns

问题

我试图将一列数据从range(0,5)中的值转换为6列,根据它们的值分配到相应的列。例如,如果值为0,则这六列中的第一列变为1,其他列变为0,依此类推。然而,由于我的目标数据的形状是(1034892, 1),这需要很长的时间,甚至有时会崩溃。这段代码在处理500000个数据时是有效的,但对于这么多数据则不行。

有没有办法让它在这么多数据的情况下也能运行?

def convert_to_num_class(target):
    for i, value in enumerate(target):
        if i == 0:
            y_new = np.array(np.eye(6)[int(value[0])])
        else:
            y_new = np.vstack((y_new, np.eye(6)[int(value[0])]))
    return y_new
英文:

I'm trying to convert one column of data which has values in range(0,5) to 6 columns according to their value.
For example if its value is 0 the first column of those six one becomes one and other become 0 and so on.However since the shape of my target is (1034892, 1) it takes a lot of time and even sometime it crashes. This code has worked for 500000 of data but for this amount it does not.

Is there any way to make it possible for this amount of data?

def convert_to_num_class(target):
    for i, value in enumerate(target):
        if i ==0:
            y_new =np.array( np.eye(6)[int(value[0])])
        else:
            y_new = np.vstack((y_new, np.eye(6)[int(value[0])]))
    return(y_new)

答案1

得分: 1

使用pandas的get_dummies函数:

>>> target = np.random.randint(6, size=(10, 1))  # 原始目标数组的形状为(1034892, 1)
>>> target = target.flatten()
array([0, 1, 0, 0, 4, 3, 1, 5, 4, 5])

>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

如果你的目标数组中不包含你所需的全部值(就像上面的示例中target中没有值2),那么会出现缺失的列。一个解决方法如下:

>>> target = pd.Categorical(target, categories=np.arange(6))

>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1]])

即使对于大小为(1034892, 1)target也非常快。

英文:

Using pandas get_dummies:

>>> target = np.random.randint(6, size=(10, 1))  # the original target is of shape (1034892, 1)
>>> target = target.flatten()
array([0, 1, 0, 0, 4, 3, 1, 5, 4, 5])

>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

In case your target doesn't have all the values in the range you want (as in the above example where target doesn't have the value 2), there will be missing columns for that missing values. One workaround is the following:

>>> target = pd.Categorical(target, categories=np.arange(6))

>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1]])

It is very fast even for target of size (1034892, 1) that you have.

答案2

得分: 0

我也通过使用keras.utils.np_utilsto_categorical方法来解决了这个问题,对于这么多的数据只需要一秒钟:

from keras.utils.np_utils import to_categorical

def convert_to_num_class(target):
    target = target.astype(np.int)
    return to_categorical(target, len(np.unique(target)))
英文:

I also have solved it by using to_categorical of keras.utils.np_utils and it takes just a second for this amount of data:

from keras.utils.np_utils import to_categorical
def convert_to_num_class(target):
target = target.astype(np.int)
return(to_categorical(target, len(np.unique(target))))

答案3

得分: 0

不需要使用 pandaskeras,只需使用元组进行索引:

import numpy as np

categories = 6
N = 10
target = np.random.randint(categories, size=(N,1)) # 这应该是您的数据

y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target.flatten())
y[mask] = 1

性能检查:

def one_hot(target, categories=None): 
    target = target.flatten() 
    N = target.size 
    if categories is None:
        categories = target.max() - target.min() + 1 
    y = np.zeros((N, categories), dtype=np.uint8) 
    mask = (np.arange(N), target) 
    y[mask] = 1 
    return y 

N = 1034892
cats = 6
r = np.random.randint(cats, size=(N))

%timeit one_hot(r)
# 9.63 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

import pandas as pd
%timeit pd.get_dummies(r).to_numpy()
# 18.2 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
英文:

No need to resort to pandas or keras, just index using a tuple:

import numpy as np

categories = 6
N = 10
target = np.random.randint(categories, size=(N,1)) # this should be your data

y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target.flatten())
y[mask] = 1

Performance check:

def one_hot(target, categories=None): 
    target = target.flatten() 
    N = target.size 
    if categories is None:
        categories = target.max() - target.min() + 1 
    y = np.zeros((N, categories), dtype=np.uint8) 
    mask = (np.arange(N), target) 
    y[mask] = 1 
    return y 

N = 1034892
cats = 6
r = np.random.randint(cats, size=(N))

%timeit one_hot(r)
# 9.63 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

import pandas as pd
%timeit pd.get_dummies(r).to_numpy()
# 18.2 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

huangapple
  • 本文由 发表于 2020年1月6日 23:55:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/59615147.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定