英文:
why is model.predict(...) always returning the same answer?
问题
我正在尝试使用scikit-learn基于一些客户数据进行预测,以根据他们提供的答案和我们的历史客户项目来确定财务效益估计。
我的数据集看起来像这样:
# 数据(1-15 of 470)
array(
[[8662824, 34],
[ 7978337, 25],
[ 902219, 28],
[29890885, 64],
[14357494, 60],
[ 6403602, 43],
[96538844, 372],
[ 7675132, 67],
[34807493, 78],
[46215428, 75],
[ 5437889, 20],
[16674835, 50],
[17382472, 20],
[ 5437889, 20],
[ 313111, 0]])
目标(1-15 of 470)
array([2739267, 20539, 18304, 16052, 25391, 19444, 61550,
94392, 75934, 52997, 67485, 92263, 37672, 6748523,
20710])
实际数据集中有470行。
我正在使用:
```python
x_train, x_test, y_train, y_test = train_test_split(
data,
targets,
test_size=.25,
random_state=42
)
model = LogisticRegression(max_iter=5000) # 直到我学会如何缩放为止
model.fit(x_train, y_train)
# 如果我运行model.predict(...),无论如何,我都会得到30000
model.predict([[50000, 50]])
这里是一些实际的shell输出(请注意分数):
In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])
In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])
In [138]: model.predict(np.array([[21500000000000, 2]))
Out[138]: array([30000])
In [139]: model.predict(np.array([[21500000000000, -444444]))
Out[139]: array([30000])
In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])
In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364
In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)
这是模型的一些元数据(通过. __dict__
获取):
{'penalty': 'l2',
'dual': False,
'tol': 0.0001,
'C': 1.0,
'fit_intercept': True,
'intercept_scaling': 1,
'class_weight': None,
'random_state': None,
'solver': 'lbfgs',
'max_iter': 5000,
'multi_class': 'auto',
'verbose': 0,
'warm_start': False,
'n_jobs': None,
'l1_ratio': None,
'n_features_in_': 2,
...
显然,这两个数据点之间存在更多的关系,而不仅仅是0.0093的分数所暗示的。毕竟,我们当前正在使用相同的数据来进行预测。你知道我在做错什么,或者在什么情况下,一个经过训练的模型会一直返回相同的答案吗?
英文:
I'm trying to use scikit-learn to make predictions based on some client data, to determine a financial benefit estimate based on some answers they give us and based on our historical client projects.
My dataset looks like this:
# Data (1-15 of 470)
array(
[[8662824, 34],
[ 7978337, 25],
[ 902219, 28],
[29890885, 64],
[14357494, 60],
[ 6403602, 43],
[96538844, 372],
[ 7675132, 67],
[34807493, 78],
[46215428, 75],
[ 5437889, 20],
[16674835, 50],
[17382472, 20],
[ 5437889, 20],
[ 313111, 0]])
# Targets (1-15 of 470)
array([2739267, 20539, 18304, 16052, 25391, 19444, 61550,
94392, 75934, 52997, 67485, 92263, 37672, 6748523,
20710])
There are 470 rows each in the actual data.
I'm using:
x_train, x_test, y_train, y_test = train_test_split(
data,
targets,
test_size=.25,
random_state=42
)
model = LogisticRegression(max_iter=5000) # 5000 until I learn how to scale
model.fit(x_train, y_train)
# If I run model.predict(...), I get 30000, no matter what
model.predict([[50000, 50]]
Here's some actual shell output (see the score, also):
In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])
In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])
In [138]: model.predict(np.array([[21500000000000, 2]]))
Out[138]: array([30000])
In [139]: model.predict(np.array([[21500000000000, -444444]]))
Out[139]: array([30000])
In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])
In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364
In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)
Here's some metadata from the model (via .__dict__
):
{'penalty': 'l2',
'dual': False,
'tol': 0.0001,
'C': 1.0,
'fit_intercept': True,
'intercept_scaling': 1,
'class_weight': None,
'random_state': None,
'solver': 'lbfgs',
'max_iter': 5000,
'multi_class': 'auto',
'verbose': 0,
'warm_start': False,
'n_jobs': None,
'l1_ratio': None,
'n_features_in_': 2,
...
There's definitely more of a relationship between the 2 data points than what a score of .0093 would seem to indicate. After all, we're currently using the same data to make predictions in our mind. Do you know what it is that I'm doing wrong, or even in what circumstance it would be normal for a trained model to return the same answer always?
答案1
得分: 1
LogisticRegression 用于预测多类别离散目标。
由于您的目标似乎是一个连续变量,您应该改用LinearRegression:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
有关更多信息,请参阅此帖子。
英文:
LogisticRegression is for predicting a multi-class discrete target.
Since your target seems to be a continuous variable, you should use instead LinearRegression :
from sklearn.linear_model import LinearRegression
model = LinearRegression()
More info on this post.
答案2
得分: 1
你的目标值是一个连续变量,因此你需要使用一个回归模型。对于一个简单的回归模型,你可以使用线性回归或决策树。如果你想要一个更复杂的模型,你可以使用随机森林或梯度提升。如果你使用线性回归模型,不要忘记使用标准缩放器或稳健缩放器来缩放你的特征。
英文:
Your target value is a continuous variable so you need to use a regression model. For a simple regression model, you can use a linear regression or a decision tree. If you want a more complex model, you can use a random forest or a gradient boosting. If you use the linear regression model, don't forget to scale your features with a standard scaler or a robust scaler
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论