2020年1月6日 23:55:04go评论105阅读模式

英文:

Convert one column to specific number of columns

问题

我试图将一列数据从range(0,5)中的值转换为6列，根据它们的值分配到相应的列。例如，如果值为0，则这六列中的第一列变为1，其他列变为0，依此类推。然而，由于我的目标数据的形状是（1034892, 1），这需要很长的时间，甚至有时会崩溃。这段代码在处理500000个数据时是有效的，但对于这么多数据则不行。

有没有办法让它在这么多数据的情况下也能运行？

def convert_to_num_class(target):
    for i, value in enumerate(target):
        if i == 0:
            y_new = np.array(np.eye(6)[int(value[0])])
        else:
            y_new = np.vstack((y_new, np.eye(6)[int(value[0])]))
    return y_new

英文:

I'm trying to convert one column of data which has values in range(0,5) to 6 columns according to their value.
For example if its value is 0 the first column of those six one becomes one and other become 0 and so on.However since the shape of my target is (1034892, 1) it takes a lot of time and even sometime it crashes. This code has worked for 500000 of data but for this amount it does not.

Is there any way to make it possible for this amount of data?

def convert_to_num_class(target):
    for i, value in enumerate(target):
        if i ==0:
            y_new =np.array( np.eye(6)[int(value[0])])
        else:
            y_new = np.vstack((y_new, np.eye(6)[int(value[0])]))
    return(y_new)

答案1

得分: 1

使用pandas的get_dummies函数：

>>> target = np.random.randint(6, size=(10, 1))  # 原始目标数组的形状为(1034892, 1)
>>> target = target.flatten()
array([0, 1, 0, 0, 4, 3, 1, 5, 4, 5])
>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

如果你的目标数组中不包含你所需的全部值（就像上面的示例中target中没有值2），那么会出现缺失的列。一个解决方法如下：

>>> target = pd.Categorical(target, categories=np.arange(6))
>>> pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1]])

即使对于大小为(1034892, 1)的target也非常快。

英文:

Using pandas get_dummies:

&gt;&gt;&gt; target = np.random.randint(6, size=(10, 1))  # the original target is of shape (1034892, 1)
&gt;&gt;&gt; target = target.flatten()
array([0, 1, 0, 0, 4, 3, 1, 5, 4, 5])
&gt;&gt;&gt; pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

In case your target doesn't have all the values in the range you want (as in the above example where target doesn't have the value 2), there will be missing columns for that missing values. One workaround is the following:

&gt;&gt;&gt; target = pd.Categorical(target, categories=np.arange(6))
&gt;&gt;&gt; pd.get_dummies(target).to_numpy()
array([[1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1]])

It is very fast even for target of size (1034892, 1) that you have.

答案2

得分: 0

我也通过使用keras.utils.np_utils的to_categorical方法来解决了这个问题，对于这么多的数据只需要一秒钟：

from keras.utils.np_utils import to_categorical
def convert_to_num_class(target):
    target = target.astype(np.int)
    return to_categorical(target, len(np.unique(target)))

英文:

I also have solved it by using to_categorical of keras.utils.np_utils and it takes just a second for this amount of data:

from keras.utils.np_utils import to_categorical
def convert_to_num_class(target):
target = target.astype(np.int)
return(to_categorical(target, len(np.unique(target))))

答案3

得分: 0

不需要使用 pandas 或 keras，只需使用元组进行索引：

import numpy as np
categories = 6
N = 10
target = np.random.randint(categories, size=(N,1)) # 这应该是您的数据
y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target.flatten())
y[mask] = 1

性能检查：

def one_hot(target, categories=None): 
    target = target.flatten() 
    N = target.size 
    if categories is None:
        categories = target.max() - target.min() + 1 
    y = np.zeros((N, categories), dtype=np.uint8) 
    mask = (np.arange(N), target) 
    y[mask] = 1 
    return y 
N = 1034892
cats = 6
r = np.random.randint(cats, size=(N))
%timeit one_hot(r)
# 9.63 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
import pandas as pd
%timeit pd.get_dummies(r).to_numpy()
# 18.2 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

英文:

No need to resort to pandas or keras, just index using a tuple:

import numpy as np
categories = 6
N = 10
target = np.random.randint(categories, size=(N,1)) # this should be your data
y = np.zeros((N, categories), dtype=np.uint8)
mask = (np.arange(N), target.flatten())
y[mask] = 1

Performance check:

def one_hot(target, categories=None): 
    target = target.flatten() 
    N = target.size 
    if categories is None:
        categories = target.max() - target.min() + 1 
    y = np.zeros((N, categories), dtype=np.uint8) 
    mask = (np.arange(N), target) 
    y[mask] = 1 
    return y 
N = 1034892
cats = 6
r = np.random.randint(cats, size=(N))
%timeit one_hot(r)
# 9.63 ms &#177; 187 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
import pandas as pd
%timeit pd.get_dummies(r).to_numpy()
# 18.2 ms &#177; 183 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将一列转换为特定数量的列

问题

答案1

答案2

答案3

解析HTML并替换文本，使用Python的BeautifulSoup。

如何在我的自己的Python包中正确使用类型提示

`pd.read_excel`出错，显示没有这样的文件或目录。

Python BeautifulSoup 爬取和收集数据

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。