线性回归的预测差距很大,我想知道应该改变什么。

huangapple go评论95阅读模式
英文:

Linear regression predictions are way off i would like some pointers as to what should i change

问题

我使用Python与sklearn和pandas库进行机器学习项目。
数据集包含来自Steam的71,000款游戏,包括它们的评分和游戏时间,我从Kaggle上获取的,名为“SteamGames(71k games)”由“MEXWELL”创建,如果你想自己查看数据。

我这个项目的目标是预测平均游戏时间,我尝试过使用评分、价格和推荐作为我的训练集选项。

对于游戏时间的预测结果甚至有时候会出现负数。

这里是我创建DataFrame的代码:

games_data = pd.read_csv("games.csv")
games_data = games_data.dropna(axis=0)
train, test = train_test_split(games_data, test_size=0.2)
games_data.columns

这里是我将数据集分割为训练集和测试集的代码:

y_train = train['Average playtime forever']

game_features = ['Score rank', 'Recommendations']
X_train = train[game_features]

X_train.describe()

y_test = test['Average playtime forever']
X_test = test[game_features]

这里是训练模型的代码:

games_model = linear_model.LinearRegression()
games_model.fit(X_train, y_train)

这是预测的结果:

print("对以下5款游戏进行预测")
print(X_test.head())
print("预测结果为:\n")
print(games_model.predict(X_test.head()))
print("实际值为:")
print(y_test.head())

我是一个机器学习初学者,只是尝试做了一个小项目,希望能从中学到一些东西。如果能给我一些建议或者整体方法上的改进,我会很高兴的。

英文:

I use Python with sklearn and pandas libraries for and ML project
the dataset contains 71k games from steam with their scores and playtime, I took it from kaggle its called "SteamGames (71k games)" by "MEXWELL" if you want to see the data yourself.
my goal with this project is to predict the average time played, I tried different options with my training set as score, price and Recommendations.
and my predictions for the play time are way off even as to go to the negatives some time.

here I create the dataframe:

games_data = pd.read_csv("games.csv")
games_data = games_data.dropna(axis=0)
train, test = train_test_split(games_data, test_size=0.2)
games_data.columns

here I split into train and test:

y_train = train['Average playtime forever']

game_features = ['Score rank', 'Recommendations']
X_train = train[game_features]

X_train.describe()

y_test = test['Average playtime forever']
X_test = test[game_features]

here's the training:

games_model = linear_model.LinearRegression()
games_model.fit(X_train, y_train)

and those are the predictions:

print("Making predictions for the following 5 games")
print(X_test.head())
print("the predictions are: .\n")
print(games_model.predict(X_test.head()))
print("the values are: ")
print(y_test.head())

Im a begginer in ML and just tried making something small and hope to learn from this project I will be happy even for some directions and changes to my approach as a whole

答案1

得分: 1

在机器学习领域有一句著名的话:“如果提供高质量数据,一个糟糕的模型可能表现得比一个优秀的模型更好”。

你既没有进行数据清洗,也没有进行任何数据预处理,导致你的模型不起作用的原因可能有很多,可能存在异常值,你选择的特征数量很少,可能存在欠拟合问题。

进行一些数据清洗、特征提取和数据预处理,增加特征数量,并尝试多项式回归,因为许多特征之间的复杂关系无法通过简单的线性回归表示。

祝好运!

英文:

There is a famous saying in the field of Machine Learning, "A Bad model can perform better than a Great Model if provided high quality data".

You have neither performed any data cleaning nor any data Preprocessing, there may me several reasons for why your Model is not working, there maybe outliers, you have taken very less features so underfitting may be present.

Perform some data cleaning, feature extraction, data preprocessing, take more number of features and try Polynomial Regression this time as such complex relations between many features cannot be represented through simple Linear Regression.

Good Luck !

huangapple
  • 本文由 发表于 2023年7月3日 04:05:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76600582.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定