2023年2月6日 18:20:18go评论97阅读模式

英文:

How to pivot dataframe into ML format

问题

我的头都快转晕了，试图弄清楚我是应该使用pivot_table、melt还是其他的函数。

我有一个数据框看起来像这样：

       月份  日  星期几  英文班级名  起点  终点
0      1   7     2       1       2     5
1      1   2     6       2       1   167
2      2   1     5       1       2    54
3      2   2     6       4       1     6
4      1   2     6       5       6     1

但我想要将它转换成这样：

      月份_1  月份_2 ... 英文班级名_1 英文班级名_2 ... 起点_1 起点_2 ... 终点_1
0      1       0          1             0            0         1        0      
1      1       0          0             1            1         0        0
2      0       1          1             0            0         1        0
3      0       1          0             0            1         0        0
4      1       0          0             0            0         0        1

基本上，将所有的值变成列，然后有一个二进制的行 - 如果该列存在则为1，如果不存在则为0。

我不知道是否可能用单个函数来实现，但会感激任何形式的帮助！

英文:

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.

I have a DF that looks like this:

> month day week_day classname_en origin destination
> 0 1 7 2 1 2 5
> 1 1 2 6 2 1 167
> 2 2 1 5 1 2 54
> 3 2 2 6 4 1 6
> 4 1 2 6 5 6 1

But I want to turn it into something like:

> month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
> 0 1 0 1 0 0 1 0
> 1 1 0 0 1 1 0 0
> 2 0 1 1 0 0 1 0
> 3 0 1 0 0 1 0 0
> 4 1 0 0 0 0 0 1

Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.

IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!

答案1

得分: 1

使用 pd.get_dummies：

out = pd.get_dummies(df, columns=df.columns)
print(out)
# 输出
   month_1  month_2  day_1  day_2  day_7  week_day_2  week_day_5  ...  origin_2  origin_6  destination_1  destination_5  destination_6  destination_54  destination_167
0        1        0      0      0      1           1           0  ...         1         0              0              1              0               0                0
1        1        0      0      1      0           0           0  ...         0         0              0              0              0               0                1
2        0        1      1      0      0           0           1  ...         1         0              0              0              0               1                0
3        0        1      0      1      0           0           0  ...         0         0              0              0              1               0                0
4        1        0      0      1      0           0           0  ...         0         1              1              0              0               0                0
[5 行 x 20 列]

英文:

Use pd.get_dummies:

out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
   month_1  month_2  day_1  day_2  day_7  week_day_2  week_day_5  ...  origin_2  origin_6  destination_1  destination_5  destination_6  destination_54  destination_167
0        1        0      0      0      1           1           0  ...         1         0              0              1              0               0                0
1        1        0      0      1      0           0           0  ...         0         0              0              0              0               0                1
2        0        1      1      0      0           0           1  ...         1         0              0              0              0               1                0
3        0        1      0      1      0           0           0  ...         0         0              0              0              1               0                0
4        1        0      0      1      0           0           0  ...         0         1              1              0              0               0                0
[5 rows x 20 columns]

答案2

得分: 1

为了扩展@Corraliens的回答

这确实是一种方法，但由于您是为了机器学习目的而编写，可能会引入错误。使用上面的代码，您会获得一个具有20个特征的矩阵。现在，假设您想在一些数据上进行预测，而这些数据突然比您的训练数据多一个月，那么您的预测数据中的矩阵将具有21个特征，因此您无法将其解析到已拟合的模型中。

为了解决这个问题，您可以使用独热编码来自Sklearn。它将确保您的“新数据”始终具有与训练数据相同数量的特征。

import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# 输出
   age  color_blue  color_red
0   10           0          1
1   15           1          0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#输出
   age  color_blue  color_green  color_red
0   10           0            0          1
1   15           1            0          0
2   20           0            1          0

正如您所看到的，颜色二进制表示的顺序也发生了变化。

另一方面，如果使用OneHotEncoder，则可以避免所有这些问题

from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore") 
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #创建稀疏矩阵
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# 输出
   color_blue  color_red
0           0          1  
1           1          0      
# 现在转换新数据
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#输出
  color_blue  color_red
0           0          1
1           1          0
2           0          0

请注意，在最后一行中，blue和red都是零，因为它具有color="green"，而这在训练数据中不存在。

请注意，todense()函数仅在这里用于说明它的工作原理。通常，您可能希望保持它是一个稀疏矩阵，并使用例如scipy.sparse.hstack来附加其他特征，如age。

英文:

To expand @Corraliens answer

It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.

To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.

import pandas as pd
df_train = pd.DataFrame({&quot;color&quot;:[&quot;red&quot;,&quot;blue&quot;],&quot;age&quot;:[10,15]})
pd.get_dummies(df_train)
# output
   age  color_blue  color_red
0   10           0          1
1   15           1          0
df_new = pd.DataFrame({&quot;color&quot;:[&quot;red&quot;,&quot;blue&quot;,&quot;green&quot;],&quot;age&quot;:[10,15,20]})
pd.get_dummies(df_new)
#output
   age  color_blue  color_green  color_red
0   10           0            0          1
1   15           1            0          0
2   20           0            1          0

and as you can see, the order of the color-binary representation has also changed.

If we on the other hand use OneHotEncoder you can ommit all those issues

from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({&quot;color&quot;:[&quot;red&quot;,&quot;blue&quot;],&quot;age&quot;:[10,15]})
ohe = OneHotEncoder(handle_unknown=&quot;ignore&quot;) 
color_ohe_transformed= ohe.fit_transform(df_train[[&quot;color&quot;]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
   color_blue  color_red
0           0          1  
1           1          0      
# now transform new data
df_new = pd.DataFrame({&quot;color&quot;:[&quot;red&quot;,&quot;blue&quot;,&quot;green&quot;],&quot;age&quot;:[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[[&quot;color&quot;]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
  color_blue  color_red
0           0          1
1           1          0
2           0          0

note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.

Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.

答案3

得分: 1

你可以使用pandas的get_dummies函数，将数据按行转换为列。

对于这个任务，你的代码如下：

import pandas as pd
df = pd.DataFrame({
    'month': [1, 1, 2, 2, 1],
    'day': [7, 2, 1, 2, 2],
    'week_day': [2, 6, 5, 6, 6],
    'classname_en': [1, 2, 1, 4, 5],
    'origin': [2, 1, 2, 1, 6],
    'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)

结果：

英文:

You can use get_dummies function of pandas for convert row to column based on data.

For that your code will be:

import pandas as pd
df = pd.DataFrame({
    &#39;month&#39;: [1, 1, 2, 2, 1],
    &#39;day&#39;: [7, 2, 1, 2, 2],
    &#39;week_day&#39;: [2, 6, 5, 6, 6],
    &#39;classname_en&#39;: [1, 2, 1, 4, 5],
    &#39;origin&#39;: [2, 1, 2, 1, 6],
    &#39;destination&#39;: [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)

Result :

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将数据框架转换为机器学习格式

问题

答案1

答案2

答案3

Pyinstaller在导入backtesting.py模块时出现错误。

合并具有不同的两个键的pandas数据帧

Hashicorp Vault: Python hvac看不到secrets

如何加速 pandas 中 resample、idxmax 和 idxmin 列的计算？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论