2023年8月5日 05:46:30go评论107阅读模式

英文:

Getting a negative prediction after min-max scaling the price in a linear regression

问题

I understand your concerns with scaling and descaling your data. It appears that there are some issues with the scaling process in your code. To address these problems, I'll provide corrected code snippets for both scaling and descaling your data. I'll also address the input prediction issue.

Scaling and Descaling Code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Assuming you have x_train, y_train (original data), w_final, and b_final
# Min-max scaling for y_train
min_y_train = np.min(y_train)
max_y_train = np.max(y_train)
y_train_scaled = (y_train - min_y_train) / (max_y_train - min_y_train)
# Descale the predictions using the min-max scaling parameters (min and max)
predictions_scaled = w_final * x_train + b_final
predictions_descaled = predictions_scaled * (max_y_train - min_y_train) + min_y_train
# Plot the original x_train and descaled y_train
plt.scatter(x_train, y_train, label='Original Data')
# Plot the predicted values (descaled)
plt.plot(x_train, predictions_descaled, color='red', label='Predicted Values (Descaled)')
plt.xlabel('x_train')
plt.ylabel('y_train')
plt.title('Original y_train vs. Predicted Values (Descaled)')
plt.legend()
plt.show()

This code correctly scales and descales your data for plotting.

Input Prediction Code:

# Assuming you have already scaled your x_train
# Input your own x for prediction
input_x = 1  # Replace with your desired input
# Scale the input
scaled_input = (input_x - xmin) / (xmax - xmin)
# Make the prediction
prediction_scaled = w_final * scaled_input + b_final
# Descale the prediction using the min-max scaling parameters (min and max)
prediction_descaled = prediction_scaled * (max_y_train - min_y_train) + min_y_train
print("Descaled Prediction:", prediction_descaled)

This code allows you to input your own value for x and obtain a descaled prediction.

Make sure you have correctly scaled your data before using these code snippets. This should help you avoid issues with excessively large or small predictions.

英文:

I was trying to make my mean squared error cost lower by scaling the target feature, primarily because it reaches 1e10's in digit

I use this dataset from kaggle to calculate land price X = LT, Y = Harga https://www.kaggle.com/datasets/wisnuanggara/daftar-harga-rumah

The code i used to input into numpy array:

import os
import openpyxl
from openpyxl import Workbook
import numpy as np
wb = openpyxl.load_workbook(&#39;DATA RUMAH.xlsx&#39;)
ws = wb.active
y_train_data = np.array([])
x_train_data = np.array([])
def get_x_train():
    x_train = np.array([])  # Initialize x_train as a local variable
    for x in range(2, 1011):
        data = ws.cell(row=x, column=5).value
        x_train = np.append(x_train, data)
    return x_train
def get_y_train():
    y_train = np.array([])  # Initialize y_train as a local variable
    for y in range(2, 1011):
        data = ws.cell(row=y, column=3).value
        y_train = np.append(y_train, data)
    return y_train

Linear regression & Gradient Descent Code:

import math, copy
import numpy as np
import matplotlib.pyplot as plt
from excltool import *
import pandas as pd
import seaborn as sns
%matplotlib inline
# Load our data set
x_train = get_x_train()#features
y_train = get_y_train()  #target value
mean = np.mean(y_train)
min = np.min(y_train)
max = np.max(y_train)
y_train = np.array([(i - min) / (max - min) for i in y_train])
#Function to calculate the cost
def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = np.float64(0)
    
    for i in range(m):
        f_wb = (w * x[i] + b)
        cost = (cost + (f_wb - y[i])**2)
    total_cost = np.float64(1 / (2 * m) * cost)
    return total_cost
def compute_gradient(x, y, w, b): 
    &quot;&quot;&quot;
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     &quot;&quot;&quot;
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = np.float64(0)
    dj_db = np.float64(0)
    
    for i in range(m):  
        f_wb = (w * x[i] + b) 
        dj_dw_i = ((f_wb - y[i]) * x[i])
        dj_db_i = (f_wb - y[i]) 
        dj_db += (dj_db_i)
        dj_dw += (dj_dw_i) 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
    &quot;&quot;&quot;
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
        x (ndarray (m,))  : Data, m examples 
        y (ndarray (m,))  : target values
        w_in,b_in (scalar): initial values of model parameters  
        alpha (float):     Learning rate
        num_iters (int):   number of iterations to run gradient descent
        cost_function:     function to call to produce cost
        gradient_function: function to call to produce gradient
      
    Returns:
        w (scalar): Updated value of parameter after running gradient descent
        b (scalar): Updated value of parameter after running gradient descent
        J_history (List): History of cost values
        p_history (list): History of parameters [w,b] 
    &quot;&quot;&quot;
    
    # Specify data type as np.float64 for w, b
    w = np.float64(w_in)
    b = np.float64(b_in)
    
    # An array to store cost J and w&#39;s at each iteration primarily for graphing later
    J_history = []
    p_history = []
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     
        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = np.float64(w - alpha * dj_dw)                            
        # Save cost J at each iteration
        J_history.append(cost_function(x, y, w, b))
        p_history.append([w, b])
        # Print cost every at intervals 10 times or as many iterations if &lt; 10
        if i % math.ceil(num_iters/100) == 0:
            print(f&quot;Iteration {i:4}: Cost {J_history[-1]:0.2e} &quot;,
                  f&quot;dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  &quot;,
                  f&quot;w: {w: 0.3e}, b: {b: 0.5e}&quot;)
 
    return w, b, J_history, p_history  # Return w and J,w history for graphing
# Initialize parameters with np.float64 data type
w_init = np.float64(0)
b_init = np.float64(0)
# Some gradient descent settings
iterations = 1000000
tmp_alpha = np.float64(1.0e-10)
# Run gradient descent
w_final, b_final, J_hist, p_hist = (gradient_descent(x_train, y_train, w_init, b_init, tmp_alpha,
                                                    iterations, compute_cost, compute_gradient))
# Print the result
print(f&quot;(w, b) found by gradient descent: ({w_final}, {b_final})&quot;)

got the last result as:

Iteration 950000: Cost 2.24e-03  dj_dw: -9.486e-03, dj_db:  3.615e-03   w:  4.850e-04, b:  9.52354e-07
Iteration 960000: Cost 2.24e-03  dj_dw: -8.682e-03, dj_db:  3.617e-03   w:  4.850e-04, b:  9.48737e-07
Iteration 970000: Cost 2.24e-03  dj_dw: -7.946e-03, dj_db:  3.619e-03   w:  4.850e-04, b:  9.45119e-07
Iteration 980000: Cost 2.24e-03  dj_dw: -7.273e-03, dj_db:  3.621e-03   w:  4.850e-04, b:  9.41499e-07
Iteration 990000: Cost 2.24e-03  dj_dw: -6.657e-03, dj_db:  3.623e-03   w:  4.850e-04, b:  9.37877e-07
(w, b) found by gradient descent: (0.00048503387319465645, 9.34254408473887e-07)

descaled the y_train:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Assuming you have x_train, y_train (already scaled), w_final, and b_final
# Descale the y_train using the min-max scaling parameters (min and max)
min_y_train = np.min(y_train)
max_y_train = np.max(y_train)
y_train_descaled = y_train * (max - min) + min
# Compute the predicted values based on the descaled y_train
predictions = w_final * x_train + b_final
# Descale the predictions using the min-max scaling parameters (min and max)
predictions_descaled = predictions * (max - min) + min
# Plot the original x_train and descaled y_train
plt.scatter(x_train, y_train_descaled, label=&#39;Original Data&#39;)
# Plot the predicted values
plt.plot(x_train, predictions_descaled, color=&#39;red&#39;, label=&#39;Predicted Values&#39;)
plt.xlabel(&#39;x_train&#39;)
plt.ylabel(&#39;y_train&#39;)
plt.title(&#39;Descaled y_train vs. Predicted Values&#39;)
plt.legend()
plt.show()

Predicted Plot

made my input prediction:

prediction = w_final*1 + b_final
prediction_descaled = prediction * (max - min) + min
print(prediction_descaled)

which result in -0.11096116854066342 where it shouldnt even be a minus, if it was not scaled everything works fine but my cost reaches 9e18, i wanted to make it lower so that i can better present it

i think i messed up in the descaling process

*EDIT:
i also tried scaling my x

mean = np.mean(x_train)
xmin = np.min(x_train)
xmax = np.max(x_train)
x_train = np.array([(i - xmin) / (xmax - xmin) for i in x_train])

and then plotted the descaled version of both:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Assuming you have x_train, y_train (already scaled), w_final, and b_final
# Descale the y_train using the min-max scaling parameters (min and max)
y_train_descaled = y_train * (ymax - ymin) + ymin
x_train_descaled = x_train * (xmax - xmin) + xmin
# Compute the predicted values based on the descaled y_train
predictions = w_final * x_train + b_final
# Descale the predictions using the min-max scaling parameters (min and max)
predictions_descaled = predictions * (ymax - ymin) + ymin
# Plot the original x_train and descaled y_train
plt.scatter(x_train_descaled, y_train_descaled, label=&#39;Original Data&#39;)
# Plot the predicted values
plt.plot(x_train_descaled, predictions_descaled, color=&#39;red&#39;, label=&#39;Predicted Values&#39;)
plt.xlabel(&#39;x_train&#39;)
plt.ylabel(&#39;y_train&#39;)
plt.title(&#39;Descaled y_train vs. Predicted Values&#39;)
plt.legend()
plt.show()

Descaled_Plot

and tried to input a prediction of my own x(1):

scaled_input = (1 - xmin) / (xmax - xmin)
prediction = w_final*scaled_input + b_final
prediction_descaled = prediction * (ymax - ymin) + ymin
print(prediction_descaled)

but got a result of 92720747 which is too big because the non-scaled code outputted a more reasonable 33047038

答案1

得分: 0

Looking at the plot you provided, it seems like any x value below ~220 will result in a y value below 0. The negative value you're getting matches with what the red line shows.

There may be an issue with scaling. You're scaling y, but not x. I think it's usually more important to scale x. Assuming x is a matrix of shape samples x features, then you can scale x as follows:

# Scaling training features
# x_train.ptp() calculates x.max() - x.min()
x_train_scaled = (x_train - x_train.min(axis=0)) / x_train.ptp(axis=0)
# fit model using x_train_scaled...

This will scale all the columns (features) in one go. Then, when you want to make a prediction with a new x, scale the new x using the training x values:

x_pred_scaled = (x_pred - x_train.min(axis=0)) / x_train.ptp(axis=0)
# y_pred = w X x_pred_scaled + intercept

What I described above is just for x. In the code you provided, you were scaling y as well, and the way you scaled/unscaled y looks correct.

If your cost is reaching 9e18 something may be going wrong, like almost dividing by zero. At the last iteration, your cost was down to 2.24e-03 - please clarify where the 9e18 is from.

英文:

Looking at the plot you provided, it seems like any x value below ~220 will result in a y value below 0. The negative value you're getting matches with what the red line shows.

There may be an issue with scaling. You're scaling y, but not x. I think it's usually more important to scale x. Assuming x is a matrix of shape samples x features, then you can scale x as follows:


#Scaling training features
# x_train.ptp() calculates x.max() - x.min()
x_train_scaled = (x_train - x_train.min(axis=0)) / x_train.ptp(axis=0)
#fit model using x_train_scaled...

This will scale all the columns (features) in one go. Then, when you want to make a prediction with a new x, scale the new x using the training x values:

x_pred_scaled = (x_pred - x_train.min(axis=0)) / x_train.ptp(axis=0)
#y_pred = w X x_pred_scaled + intercept

What I described above is just for x. In the code you provided, you were scaling y as well, and the way you scaled/unscaled y looks correct.

If your cost is reaching 9e18 something may be going wrong, like almost dividing by zero. At the last iteration your cost was down to 2.24e-03 - please clarify where the 9e18 is from.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在线性回归中，对价格进行最小-最大缩放后得到负预测。

问题

Scaling and Descaling Code:

Input Prediction Code:

答案1

无法安装来自pip的任何软件包；wheel构建失败。

Plot a function that has two parts, a constant part and a variable part with matplotlib

数据框转换在Python中

在GitLab CI配置文件脚本中出现“折叠的多行命令错误”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。