如何将数据框架转换为机器学习格式

huangapple go评论63阅读模式
英文:

How to pivot dataframe into ML format

问题

我的头都快转晕了,试图弄清楚我是应该使用pivot_tablemelt还是其他的函数。

我有一个数据框看起来像这样:

       月份  日  星期几  英文班级名  起点  终点
0      1   7     2       1       2     5
1      1   2     6       2       1   167
2      2   1     5       1       2    54
3      2   2     6       4       1     6
4      1   2     6       5       6     1

但我想要将它转换成这样:

      月份_1  月份_2 ... 英文班级名_1 英文班级名_2 ... 起点_1 起点_2 ... 终点_1
0      1       0          1             0            0         1        0      
1      1       0          0             1            1         0        0
2      0       1          1             0            0         1        0
3      0       1          0             0            1         0        0
4      1       0          0             0            0         0        1

基本上,将所有的值变成列,然后有一个二进制的行 - 如果该列存在则为1,如果不存在则为0。

我不知道是否可能用单个函数来实现,但会感激任何形式的帮助!

英文:

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.

I have a DF that looks like this:

> month day week_day classname_en origin destination
> 0 1 7 2 1 2 5
> 1 1 2 6 2 1 167
> 2 2 1 5 1 2 54
> 3 2 2 6 4 1 6
> 4 1 2 6 5 6 1

But I want to turn it into something like:

> month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
> 0 1 0 1 0 0 1 0
> 1 1 0 0 1 1 0 0
> 2 0 1 1 0 0 1 0
> 3 0 1 0 0 1 0 0
> 4 1 0 0 0 0 0 1

Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.

IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!

答案1

得分: 1

使用 pd.get_dummies

out = pd.get_dummies(df, columns=df.columns)
print(out)

# 输出
   month_1  month_2  day_1  day_2  day_7  week_day_2  week_day_5  ...  origin_2  origin_6  destination_1  destination_5  destination_6  destination_54  destination_167
0        1        0      0      0      1           1           0  ...         1         0              0              1              0               0                0
1        1        0      0      1      0           0           0  ...         0         0              0              0              0               0                1
2        0        1      1      0      0           0           1  ...         1         0              0              0              0               1                0
3        0        1      0      1      0           0           0  ...         0         0              0              0              1               0                0
4        1        0      0      1      0           0           0  ...         0         1              1              0              0               0                0

[5 行 x 20 列]
英文:

Use pd.get_dummies:

out = pd.get_dummies(df, columns=df.columns)
print(out)

# Output
   month_1  month_2  day_1  day_2  day_7  week_day_2  week_day_5  ...  origin_2  origin_6  destination_1  destination_5  destination_6  destination_54  destination_167
0        1        0      0      0      1           1           0  ...         1         0              0              1              0               0                0
1        1        0      0      1      0           0           0  ...         0         0              0              0              0               0                1
2        0        1      1      0      0           0           1  ...         1         0              0              0              0               1                0
3        0        1      0      1      0           0           0  ...         0         0              0              0              1               0                0
4        1        0      0      1      0           0           0  ...         0         1              1              0              0               0                0

[5 rows x 20 columns]

答案2

得分: 1

为了扩展@Corraliens的回答

这确实是一种方法,但由于您是为了机器学习目的而编写,可能会引入错误。使用上面的代码,您会获得一个具有20个特征的矩阵。现在,假设您想在一些数据上进行预测,而这些数据突然比您的训练数据多一个月,那么您的预测数据中的矩阵将具有21个特征,因此您无法将其解析到已拟合的模型中。

为了解决这个问题,您可以使用独热编码来自Sklearn。它将确保您的“新数据”始终具有与训练数据相同数量的特征。

import pandas as pd

df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)

# 输出
   age  color_blue  color_red
0   10           0          1
1   15           1          0


df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)

#输出

   age  color_blue  color_green  color_red
0   10           0            0          1
1   15           1            0          0
2   20           0            1          0

正如您所看到的,颜色二进制表示的顺序也发生了变化。

另一方面,如果使用OneHotEncoder,则可以避免所有这些问题

from sklearn.preprocessing import OneHotEncoder

df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore") 

color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #创建稀疏矩阵

ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]

pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)

# 输出
   color_blue  color_red
0           0          1  
1           1          0      


# 现在转换新数据

df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})

new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)

#输出

  color_blue  color_red
0           0          1
1           1          0
2           0          0

请注意,在最后一行中,bluered都是零,因为它具有color="green",而这在训练数据中不存在。

请注意,todense()函数仅在这里用于说明它的工作原理。通常,您可能希望保持它是一个稀疏矩阵,并使用例如scipy.sparse.hstack来附加其他特征,如age

英文:

To expand @Corraliens answer

It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.

To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.

import pandas as pd

df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)

# output
   age  color_blue  color_red
0   10           0          1
1   15           1          0


df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)

#output

   age  color_blue  color_green  color_red
0   10           0            0          1
1   15           1            0          0
2   20           0            1          0

and as you can see, the order of the color-binary representation has also changed.

If we on the other hand use OneHotEncoder you can ommit all those issues

from sklearn.preprocessing import OneHotEncoder

df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore") 

color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix

ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]

pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)

# output
   color_blue  color_red
0           0          1  
1           1          0      


# now transform new data

df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})

new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)

#output

  color_blue  color_red
0           0          1
1           1          0
2           0          0

note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.

Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.

答案3

得分: 1

你可以使用pandas的get_dummies函数,将数据按行转换为列。

对于这个任务,你的代码如下:

import pandas as pd

df = pd.DataFrame({
    'month': [1, 1, 2, 2, 1],
    'day': [7, 2, 1, 2, 2],
    'week_day': [2, 6, 5, 6, 6],
    'classname_en': [1, 2, 1, 4, 5],
    'origin': [2, 1, 2, 1, 6],
    'destination': [5, 167, 54, 6, 1]
})

response = pd.get_dummies(df, columns=df.columns)
print(response)

结果:
如何将数据框架转换为机器学习格式

英文:

You can use get_dummies function of pandas for convert row to column based on data.

For that your code will be:

import pandas as pd

df = pd.DataFrame({
    'month': [1, 1, 2, 2, 1],
    'day': [7, 2, 1, 2, 2],
    'week_day': [2, 6, 5, 6, 6],
    'classname_en': [1, 2, 1, 4, 5],
    'origin': [2, 1, 2, 1, 6],
    'destination': [5, 167, 54, 6, 1]
})

response = pd.get_dummies(df, columns=df.columns)
print(response)

Result :
如何将数据框架转换为机器学习格式

huangapple
  • 本文由 发表于 2023年2月6日 18:20:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75360007.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定