model.predict(…)为什么始终返回相同的答案?

huangapple go评论71阅读模式
英文:

why is model.predict(...) always returning the same answer?

问题

我正在尝试使用scikit-learn基于一些客户数据进行预测,以根据他们提供的答案和我们的历史客户项目来确定财务效益估计。

我的数据集看起来像这样:

# 数据(1-15 of 470)
array(
    [[8662824,       34],
    [ 7978337,       25],
    [  902219,       28],
    [29890885,       64],
    [14357494,       60],
    [ 6403602,       43],
    [96538844,      372],
    [ 7675132,       67],
    [34807493,       78],
    [46215428,       75],
    [ 5437889,       20],
    [16674835,       50],
    [17382472,       20],
    [ 5437889,       20],
    [  313111,        0]])

目标(1-15 of 470)

array([2739267, 20539, 18304, 16052, 25391, 19444, 61550,
94392, 75934, 52997, 67485, 92263, 37672, 6748523,
20710])


实际数据集中有470行。

我正在使用:

```python
x_train, x_test, y_train, y_test = train_test_split(
    data,
    targets,
    test_size=.25,
    random_state=42
)
model = LogisticRegression(max_iter=5000)  # 直到我学会如何缩放为止
model.fit(x_train, y_train)

# 如果我运行model.predict(...),无论如何,我都会得到30000
model.predict([[50000, 50]])

这里是一些实际的shell输出(请注意分数):

In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])

In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])

In [138]: model.predict(np.array([[21500000000000, 2]))
Out[138]: array([30000])

In [139]: model.predict(np.array([[21500000000000, -444444]))
Out[139]: array([30000])

In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])

In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364

In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)

这是模型的一些元数据(通过. __dict__获取):

{'penalty': 'l2',
 'dual': False,
 'tol': 0.0001,
 'C': 1.0,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'class_weight': None,
 'random_state': None,
 'solver': 'lbfgs',
 'max_iter': 5000,
 'multi_class': 'auto',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None,
 'l1_ratio': None,
 'n_features_in_': 2,
 ...

显然,这两个数据点之间存在更多的关系,而不仅仅是0.0093的分数所暗示的。毕竟,我们当前正在使用相同的数据来进行预测。你知道我在做错什么,或者在什么情况下,一个经过训练的模型会一直返回相同的答案吗?

英文:

I'm trying to use scikit-learn to make predictions based on some client data, to determine a financial benefit estimate based on some answers they give us and based on our historical client projects.

My dataset looks like this:

 # Data (1-15 of 470)
 array(
    [[8662824,       34],
    [ 7978337,       25],
    [  902219,       28],
    [29890885,       64],
    [14357494,       60],
    [ 6403602,       43],
    [96538844,      372],
    [ 7675132,       67],
    [34807493,       78],
    [46215428,       75],
    [ 5437889,       20],
    [16674835,       50],
    [17382472,       20],
    [ 5437889,       20],
    [  313111,        0]])

 # Targets (1-15 of 470)
 array([2739267,   20539,   18304,   16052,   25391,   19444,   61550,
      94392,   75934,   52997,   67485,   92263,   37672, 6748523,
      20710])

There are 470 rows each in the actual data.

I'm using:

x_train, x_test, y_train, y_test = train_test_split(
    data,
    targets,
    test_size=.25,
    random_state=42
)
model = LogisticRegression(max_iter=5000)  # 5000 until I learn how to scale
model.fit(x_train, y_train)

# If I run model.predict(...), I get 30000, no matter what
model.predict([[50000, 50]]

Here's some actual shell output (see the score, also):

In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])

In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])

In [138]: model.predict(np.array([[21500000000000, 2]]))
Out[138]: array([30000])

In [139]: model.predict(np.array([[21500000000000, -444444]]))
Out[139]: array([30000])

In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])

In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364

In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)

Here's some metadata from the model (via .__dict__):

{'penalty': 'l2',
 'dual': False,
 'tol': 0.0001,
 'C': 1.0,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'class_weight': None,
 'random_state': None,
 'solver': 'lbfgs',
 'max_iter': 5000,
 'multi_class': 'auto',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None,
 'l1_ratio': None,
 'n_features_in_': 2,
 ...

There's definitely more of a relationship between the 2 data points than what a score of .0093 would seem to indicate. After all, we're currently using the same data to make predictions in our mind. Do you know what it is that I'm doing wrong, or even in what circumstance it would be normal for a trained model to return the same answer always?

答案1

得分: 1

LogisticRegression 用于预测多类别离散目标。

由于您的目标似乎是一个连续变量,您应该改用LinearRegression

from sklearn.linear_model import LinearRegression 

model = LinearRegression() 

有关更多信息,请参阅此帖子

英文:

LogisticRegression is for predicting a multi-class discrete target.

Since your target seems to be a continuous variable, you should use instead LinearRegression :

from sklearn.linear_model import LinearRegression

model = LinearRegression() 

More info on this post.

答案2

得分: 1

你的目标值是一个连续变量,因此你需要使用一个回归模型。对于一个简单的回归模型,你可以使用线性回归决策树。如果你想要一个更复杂的模型,你可以使用随机森林梯度提升。如果你使用线性回归模型,不要忘记使用标准缩放器稳健缩放器缩放你的特征。

英文:

Your target value is a continuous variable so you need to use a regression model. For a simple regression model, you can use a linear regression or a decision tree. If you want a more complex model, you can use a random forest or a gradient boosting. If you use the linear regression model, don't forget to scale your features with a standard scaler or a robust scaler

huangapple
  • 本文由 发表于 2023年2月24日 11:36:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75552397.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定