2023年2月24日 11:36:41go评论134阅读模式

英文:

why is model.predict(...) always returning the same answer?

问题

我正在尝试使用scikit-learn基于一些客户数据进行预测，以根据他们提供的答案和我们的历史客户项目来确定财务效益估计。

我的数据集看起来像这样：

# 数据（1-15 of 470）
array(
    [[8662824,       34],
    [ 7978337,       25],
    [  902219,       28],
    [29890885,       64],
    [14357494,       60],
    [ 6403602,       43],
    [96538844,      372],
    [ 7675132,       67],
    [34807493,       78],
    [46215428,       75],
    [ 5437889,       20],
    [16674835,       50],
    [17382472,       20],
    [ 5437889,       20],
    [  313111,        0]])

目标（1-15 of 470）

array([2739267, 20539, 18304, 16052, 25391, 19444, 61550,
94392, 75934, 52997, 67485, 92263, 37672, 6748523,
20710])


实际数据集中有470行。

我正在使用：

```python
x_train, x_test, y_train, y_test = train_test_split(
    data,
    targets,
    test_size=.25,
    random_state=42
)
model = LogisticRegression(max_iter=5000)  # 直到我学会如何缩放为止
model.fit(x_train, y_train)

# 如果我运行model.predict(...)，无论如何，我都会得到30000
model.predict([[50000, 50]])

这里是一些实际的shell输出（请注意分数）：

In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])

In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])

In [138]: model.predict(np.array([[21500000000000, 2]))
Out[138]: array([30000])

In [139]: model.predict(np.array([[21500000000000, -444444]))
Out[139]: array([30000])

In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])

In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364

In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)

这是模型的一些元数据（通过. __dict__获取）：

{'penalty': 'l2',
 'dual': False,
 'tol': 0.0001,
 'C': 1.0,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'class_weight': None,
 'random_state': None,
 'solver': 'lbfgs',
 'max_iter': 5000,
 'multi_class': 'auto',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None,
 'l1_ratio': None,
 'n_features_in_': 2,
 ...

显然，这两个数据点之间存在更多的关系，而不仅仅是0.0093的分数所暗示的。毕竟，我们当前正在使用相同的数据来进行预测。你知道我在做错什么，或者在什么情况下，一个经过训练的模型会一直返回相同的答案吗？

英文:

I'm trying to use scikit-learn to make predictions based on some client data, to determine a financial benefit estimate based on some answers they give us and based on our historical client projects.

My dataset looks like this:

 # Data (1-15 of 470)
 array(
    [[8662824,       34],
    [ 7978337,       25],
    [  902219,       28],
    [29890885,       64],
    [14357494,       60],
    [ 6403602,       43],
    [96538844,      372],
    [ 7675132,       67],
    [34807493,       78],
    [46215428,       75],
    [ 5437889,       20],
    [16674835,       50],
    [17382472,       20],
    [ 5437889,       20],
    [  313111,        0]])

 # Targets (1-15 of 470)
 array([2739267,   20539,   18304,   16052,   25391,   19444,   61550,
      94392,   75934,   52997,   67485,   92263,   37672, 6748523,
      20710])

There are 470 rows each in the actual data.

I'm using:

x_train, x_test, y_train, y_test = train_test_split(
    data,
    targets,
    test_size=.25,
    random_state=42
)
model = LogisticRegression(max_iter=5000)  # 5000 until I learn how to scale
model.fit(x_train, y_train)

# If I run model.predict(...), I get 30000, no matter what
model.predict([[50000, 50]]

Here's some actual shell output (see the score, also):

In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])

In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])

In [138]: model.predict(np.array([[21500000000000, 2]]))
Out[138]: array([30000])

In [139]: model.predict(np.array([[21500000000000, -444444]]))
Out[139]: array([30000])

In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])

In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364

In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)

Here's some metadata from the model (via .__dict__):

{&#39;penalty&#39;: &#39;l2&#39;,
 &#39;dual&#39;: False,
 &#39;tol&#39;: 0.0001,
 &#39;C&#39;: 1.0,
 &#39;fit_intercept&#39;: True,
 &#39;intercept_scaling&#39;: 1,
 &#39;class_weight&#39;: None,
 &#39;random_state&#39;: None,
 &#39;solver&#39;: &#39;lbfgs&#39;,
 &#39;max_iter&#39;: 5000,
 &#39;multi_class&#39;: &#39;auto&#39;,
 &#39;verbose&#39;: 0,
 &#39;warm_start&#39;: False,
 &#39;n_jobs&#39;: None,
 &#39;l1_ratio&#39;: None,
 &#39;n_features_in_&#39;: 2,
 ...

There's definitely more of a relationship between the 2 data points than what a score of .0093 would seem to indicate. After all, we're currently using the same data to make predictions in our mind. Do you know what it is that I'm doing wrong, or even in what circumstance it would be normal for a trained model to return the same answer always?

答案1

得分: 1

LogisticRegression 用于预测多类别离散目标。

由于您的目标似乎是一个连续变量，您应该改用LinearRegression：

from sklearn.linear_model import LinearRegression 

model = LinearRegression()

有关更多信息，请参阅此帖子。

英文:

LogisticRegression is for predicting a multi-class discrete target.

Since your target seems to be a continuous variable, you should use instead LinearRegression :

from sklearn.linear_model import LinearRegression

model = LinearRegression()

答案2

得分: 1

你的目标值是一个连续变量，因此你需要使用一个回归模型。对于一个简单的回归模型，你可以使用线性回归或决策树。如果你想要一个更复杂的模型，你可以使用随机森林或梯度提升。如果你使用线性回归模型，不要忘记使用标准缩放器或稳健缩放器来缩放你的特征。

英文:

Your target value is a continuous variable so you need to use a regression model. For a simple regression model, you can use a linear regression or a decision tree. If you want a more complex model, you can use a random forest or a gradient boosting. If you use the linear regression model, don't forget to scale your features with a standard scaler or a robust scaler

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

model.predict(…)为什么始终返回相同的答案？

问题

目标（1-15 of 470）

答案1

答案2

如何将Pandas Dataframe中的Non-Monday列转换为Monday

槽检测算法返回峰值数据

如何删除缩进错误：在Google Colab中，期望有一个缩进块：除了ValueError：

逻辑回归中的线性决策边界

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论