2020年1月6日 23:27:07go评论61阅读模式

英文:

reduction of model accuracy while using PCA for a regression problem

问题

我正在尝试构建一个预测航班票价的问题。我的数据集包含了一些分类变量，如舱位、小时、星期几、月份等。我使用了多种算法，如XGBoost和人工神经网络（ANN）来拟合模型。

最初，我对这些变量进行了独热编码，导致了总共90个变量。当我尝试为这个数据拟合模型时，训练的R2分数很高，约为0.90，但测试分数相对较低（0.6）。

我尝试对时间变量进行了正弦和余弦变换，这导致总共只有27个变量。使用这些变量，训练准确度下降到0.83，但测试分数增加到0.70。

我曾考虑过我的变量可能很稀疏，尝试进行PCA，但这大大降低了训练集和测试集的性能。

因此，我有一些关于这个问题的问题。

为什么PCA没有帮助，反而严重降低了我的模型性能？
有关如何提高我的模型性能的建议吗？

# 你的Python代码部分

谢谢。

英文:

I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model

Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).

I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70

I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.

So I have few questions regarding the same.

Why is PCA not helping and inturn reducing the performance of my model so badly
Any suggestions on how to improve my model performance?

code


from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_excel(&#39;Airline Dataset1.xlsx&#39;,sheet_name=&#39;Airline Dataset1&#39;)

dataset = dataset.drop(columns = [&#39;SL. No.&#39;])
dataset[&#39;time&#39;] = dataset[&#39;time&#39;] - 24

import numpy as np
dataset[&#39;time&#39;] = np.where(dataset[&#39;time&#39;]==24,0,dataset[&#39;time&#39;])

cat_cols = [&#39;demand&#39;, &#39;from_ind&#39;, &#39;to_ind&#39;]

cyc_cols = [&#39;time&#39;,&#39;weekday&#39;,&#39;month&#39;,&#39;monthday&#39;]

def cyclic_encode(data,col,col_max):
    data[col + &#39;_sin&#39;] = np.sin(2*np.pi*data[col]/col_max)
    data[col + &#39;_cos&#39;] = np.cos(2*np.pi*data[col]/col_max)
    return data 

cyclic_encode(dataset,&#39;time&#39;,23)
cyclic_encode(dataset,&#39;weekday&#39;,6)
cyclic_encode(dataset,&#39;month&#39;,11)
cyclic_encode(dataset,&#39;monthday&#39;,31)

dataset = dataset.drop(columns=cyc_cols)


ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)


# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)

y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)


#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)

# Predicting the Test &amp; Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)


#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)

Thanks

答案1

得分: 1

你不真的需要PCA来解决这样小维度的问题。决策树在有成千上万个变量的情况下表现非常出色。

以下是一些您可以尝试的方法：

传递一个监控列表，训练直到在验证集上不再过拟合。链接
尝试所有正弦余弦转换以及其他一次性编码，并建立一个模型（同时使用监控列表）。
寻找更多的因果数据。仅仅季节性模式不会引起机票价格波动。首先，您可以为节日、假期和重要日期添加标志。还要进行与这些日期的距离的特征工程。天气数据也很容易找到并添加。

PCA通常在您具有极高维度的情况下或者所涉及的算法在高维数据上表现不佳的情况下有所帮助，比如基因组数据或kNN等。

英文:

You dont really need PCA for such small dimensional problem. Decision trees perform very well even with thousands of variables.

Here are few things you can try

Pass a watchlist and train up until you are not overfitting on validation set. https://github.com/dmlc/xgboost/blob/2d95b9a4b6d87e9f630c59995403988dee390c20/demo/guide-python/basic_walkthrough.py#L64
try all sine cosine transformations and other one hot encoding together and make a model (along with watchlist)
Looks for more causal data. Just seasonal patterns does not cause air fare fluctuations. For starting you can add flags for festivals, holidays, important dates. Also do feature engineering for proximities to these days. Weather data is also easy to find and add.

PCA usually help in cases where you have extreme dimensionality like genome data or algorithm involved doesnt do well in high dimensional data like kNN etc.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用PCA进行回归问题时模型精度降低。

问题

答案1

如何能够快速提供低延迟的机器学习模型服务？

Grad-cam总是将热力图放在相同的区域。

在PyTorch DataLoader上迭代比直接访问数据集慢。

将权重从Keras分配给Torch模型。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论