英文:
How to pivot dataframe into ML format
问题
我的头都快转晕了,试图弄清楚我是应该使用pivot_table
、melt
还是其他的函数。
我有一个数据框看起来像这样:
月份 日 星期几 英文班级名 起点 终点
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
但我想要将它转换成这样:
月份_1 月份_2 ... 英文班级名_1 英文班级名_2 ... 起点_1 起点_2 ... 终点_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
基本上,将所有的值变成列,然后有一个二进制的行 - 如果该列存在则为1,如果不存在则为0。
我不知道是否可能用单个函数来实现,但会感激任何形式的帮助!
英文:
My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
> month day week_day classname_en origin destination
> 0 1 7 2 1 2 5
> 1 1 2 6 2 1 167
> 2 2 1 5 1 2 54
> 3 2 2 6 4 1 6
> 4 1 2 6 5 6 1
But I want to turn it into something like:
> month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
> 0 1 0 1 0 0 1 0
> 1 1 0 0 1 1 0 0
> 2 0 1 1 0 0 1 0
> 3 0 1 0 0 1 0 0
> 4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!
答案1
得分: 1
使用 pd.get_dummies
:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# 输出
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 行 x 20 列]
英文:
Use pd.get_dummies
:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]
答案2
得分: 1
为了扩展@Corraliens的回答
这确实是一种方法,但由于您是为了机器学习目的而编写,可能会引入错误。使用上面的代码,您会获得一个具有20个特征的矩阵。现在,假设您想在一些数据上进行预测,而这些数据突然比您的训练数据多一个月,那么您的预测数据中的矩阵将具有21个特征,因此您无法将其解析到已拟合的模型中。
为了解决这个问题,您可以使用独热编码来自Sklearn。它将确保您的“新数据”始终具有与训练数据相同数量的特征。
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# 输出
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#输出
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
正如您所看到的,颜色二进制表示的顺序也发生了变化。
另一方面,如果使用OneHotEncoder
,则可以避免所有这些问题
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #创建稀疏矩阵
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# 输出
color_blue color_red
0 0 1
1 1 0
# 现在转换新数据
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#输出
color_blue color_red
0 0 1
1 1 0
2 0 0
请注意,在最后一行中,blue
和red
都是零,因为它具有color="green"
,而这在训练数据中不存在。
请注意,todense()
函数仅在这里用于说明它的工作原理。通常,您可能希望保持它是一个稀疏矩阵,并使用例如scipy.sparse.hstack
来附加其他特征,如age
。
英文:
To expand @Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder
you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue
and red
are both zeros since it has color= "green"
which was not present in the training data.
Note the todense()
function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack
to append your other features such as age
to it.
答案3
得分: 1
你可以使用pandas的get_dummies函数,将数据按行转换为列。
对于这个任务,你的代码如下:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
英文:
You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论