在线性回归中,对价格进行最小-最大缩放后得到负预测。

huangapple go评论71阅读模式
英文:

Getting a negative prediction after min-max scaling the price in a linear regression

问题

I understand your concerns with scaling and descaling your data. It appears that there are some issues with the scaling process in your code. To address these problems, I'll provide corrected code snippets for both scaling and descaling your data. I'll also address the input prediction issue.

Scaling and Descaling Code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Assuming you have x_train, y_train (original data), w_final, and b_final

# Min-max scaling for y_train
min_y_train = np.min(y_train)
max_y_train = np.max(y_train)
y_train_scaled = (y_train - min_y_train) / (max_y_train - min_y_train)

# Descale the predictions using the min-max scaling parameters (min and max)
predictions_scaled = w_final * x_train + b_final
predictions_descaled = predictions_scaled * (max_y_train - min_y_train) + min_y_train

# Plot the original x_train and descaled y_train
plt.scatter(x_train, y_train, label='Original Data')
# Plot the predicted values (descaled)
plt.plot(x_train, predictions_descaled, color='red', label='Predicted Values (Descaled)')
plt.xlabel('x_train')
plt.ylabel('y_train')
plt.title('Original y_train vs. Predicted Values (Descaled)')
plt.legend()
plt.show()

This code correctly scales and descales your data for plotting.

Input Prediction Code:

# Assuming you have already scaled your x_train

# Input your own x for prediction
input_x = 1  # Replace with your desired input

# Scale the input
scaled_input = (input_x - xmin) / (xmax - xmin)

# Make the prediction
prediction_scaled = w_final * scaled_input + b_final

# Descale the prediction using the min-max scaling parameters (min and max)
prediction_descaled = prediction_scaled * (max_y_train - min_y_train) + min_y_train

print("Descaled Prediction:", prediction_descaled)

This code allows you to input your own value for x and obtain a descaled prediction.

Make sure you have correctly scaled your data before using these code snippets. This should help you avoid issues with excessively large or small predictions.

英文:

I was trying to make my mean squared error cost lower by scaling the target feature, primarily because it reaches 1e10's in digit

I use this dataset from kaggle to calculate land price X = LT, Y = Harga https://www.kaggle.com/datasets/wisnuanggara/daftar-harga-rumah

The code i used to input into numpy array:

import os
import openpyxl
from openpyxl import Workbook
import numpy as np

wb = openpyxl.load_workbook('DATA RUMAH.xlsx')
ws = wb.active

y_train_data = np.array([])
x_train_data = np.array([])

def get_x_train():
    x_train = np.array([])  # Initialize x_train as a local variable
    for x in range(2, 1011):
        data = ws.cell(row=x, column=5).value
        x_train = np.append(x_train, data)
    return x_train

def get_y_train():
    y_train = np.array([])  # Initialize y_train as a local variable
    for y in range(2, 1011):
        data = ws.cell(row=y, column=3).value
        y_train = np.append(y_train, data)
    return y_train

Linear regression & Gradient Descent Code:

import math, copy
import numpy as np
import matplotlib.pyplot as plt
from excltool import *
import pandas as pd
import seaborn as sns
%matplotlib inline

# Load our data set
x_train = get_x_train()#features
y_train = get_y_train()  #target value

mean = np.mean(y_train)
min = np.min(y_train)
max = np.max(y_train)
y_train = np.array([(i - min) / (max - min) for i in y_train])

#Function to calculate the cost
def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = np.float64(0)
    
    for i in range(m):
        f_wb = (w * x[i] + b)
        cost = (cost + (f_wb - y[i])**2)
    total_cost = np.float64(1 / (2 * m) * cost)

    return total_cost

def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = np.float64(0)
    dj_db = np.float64(0)
    
    for i in range(m):  
        f_wb = (w * x[i] + b) 
        dj_dw_i = ((f_wb - y[i]) * x[i])
        dj_db_i = (f_wb - y[i]) 
        dj_db += (dj_db_i)
        dj_dw += (dj_dw_i) 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
        x (ndarray (m,))  : Data, m examples 
        y (ndarray (m,))  : target values
        w_in,b_in (scalar): initial values of model parameters  
        alpha (float):     Learning rate
        num_iters (int):   number of iterations to run gradient descent
        cost_function:     function to call to produce cost
        gradient_function: function to call to produce gradient
      
    Returns:
        w (scalar): Updated value of parameter after running gradient descent
        b (scalar): Updated value of parameter after running gradient descent
        J_history (List): History of cost values
        p_history (list): History of parameters [w,b] 
    """
    
    # Specify data type as np.float64 for w, b
    w = np.float64(w_in)
    b = np.float64(b_in)
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = np.float64(w - alpha * dj_dw)                            

        # Save cost J at each iteration
        J_history.append(cost_function(x, y, w, b))
        p_history.append([w, b])

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i % math.ceil(num_iters/100) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b: {b: 0.5e}")
 
    return w, b, J_history, p_history  # Return w and J,w history for graphing

# Initialize parameters with np.float64 data type
w_init = np.float64(0)
b_init = np.float64(0)

# Some gradient descent settings
iterations = 1000000
tmp_alpha = np.float64(1.0e-10)

# Run gradient descent
w_final, b_final, J_hist, p_hist = (gradient_descent(x_train, y_train, w_init, b_init, tmp_alpha,
                                                    iterations, compute_cost, compute_gradient))

# Print the result
print(f"(w, b) found by gradient descent: ({w_final}, {b_final})")

got the last result as:

Iteration 950000: Cost 2.24e-03  dj_dw: -9.486e-03, dj_db:  3.615e-03   w:  4.850e-04, b:  9.52354e-07
Iteration 960000: Cost 2.24e-03  dj_dw: -8.682e-03, dj_db:  3.617e-03   w:  4.850e-04, b:  9.48737e-07
Iteration 970000: Cost 2.24e-03  dj_dw: -7.946e-03, dj_db:  3.619e-03   w:  4.850e-04, b:  9.45119e-07
Iteration 980000: Cost 2.24e-03  dj_dw: -7.273e-03, dj_db:  3.621e-03   w:  4.850e-04, b:  9.41499e-07
Iteration 990000: Cost 2.24e-03  dj_dw: -6.657e-03, dj_db:  3.623e-03   w:  4.850e-04, b:  9.37877e-07
(w, b) found by gradient descent: (0.00048503387319465645, 9.34254408473887e-07)

descaled the y_train:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Assuming you have x_train, y_train (already scaled), w_final, and b_final

# Descale the y_train using the min-max scaling parameters (min and max)
min_y_train = np.min(y_train)
max_y_train = np.max(y_train)
y_train_descaled = y_train * (max - min) + min

# Compute the predicted values based on the descaled y_train
predictions = w_final * x_train + b_final

# Descale the predictions using the min-max scaling parameters (min and max)
predictions_descaled = predictions * (max - min) + min

# Plot the original x_train and descaled y_train
plt.scatter(x_train, y_train_descaled, label='Original Data')
# Plot the predicted values
plt.plot(x_train, predictions_descaled, color='red', label='Predicted Values')
plt.xlabel('x_train')
plt.ylabel('y_train')
plt.title('Descaled y_train vs. Predicted Values')
plt.legend()
plt.show()

Predicted Plot

made my input prediction:

prediction = w_final*1 + b_final
prediction_descaled = prediction * (max - min) + min
print(prediction_descaled)

which result in -0.11096116854066342 where it shouldnt even be a minus, if it was not scaled everything works fine but my cost reaches 9e18, i wanted to make it lower so that i can better present it

i think i messed up in the descaling process


*EDIT:
i also tried scaling my x

mean = np.mean(x_train)
xmin = np.min(x_train)
xmax = np.max(x_train)
x_train = np.array([(i - xmin) / (xmax - xmin) for i in x_train])

and then plotted the descaled version of both:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Assuming you have x_train, y_train (already scaled), w_final, and b_final

# Descale the y_train using the min-max scaling parameters (min and max)
y_train_descaled = y_train * (ymax - ymin) + ymin
x_train_descaled = x_train * (xmax - xmin) + xmin

# Compute the predicted values based on the descaled y_train
predictions = w_final * x_train + b_final

# Descale the predictions using the min-max scaling parameters (min and max)
predictions_descaled = predictions * (ymax - ymin) + ymin

# Plot the original x_train and descaled y_train
plt.scatter(x_train_descaled, y_train_descaled, label='Original Data')
# Plot the predicted values
plt.plot(x_train_descaled, predictions_descaled, color='red', label='Predicted Values')
plt.xlabel('x_train')
plt.ylabel('y_train')
plt.title('Descaled y_train vs. Predicted Values')
plt.legend()
plt.show()

Descaled_Plot

and tried to input a prediction of my own x(1):

scaled_input = (1 - xmin) / (xmax - xmin)

prediction = w_final*scaled_input + b_final
prediction_descaled = prediction * (ymax - ymin) + ymin
print(prediction_descaled)

but got a result of 92720747 which is too big because the non-scaled code outputted a more reasonable 33047038

答案1

得分: 0

Looking at the plot you provided, it seems like any x value below ~220 will result in a y value below 0. The negative value you're getting matches with what the red line shows.

There may be an issue with scaling. You're scaling y, but not x. I think it's usually more important to scale x. Assuming x is a matrix of shape samples x features, then you can scale x as follows:

# Scaling training features
# x_train.ptp() calculates x.max() - x.min()
x_train_scaled = (x_train - x_train.min(axis=0)) / x_train.ptp(axis=0)

# fit model using x_train_scaled...

This will scale all the columns (features) in one go. Then, when you want to make a prediction with a new x, scale the new x using the training x values:

x_pred_scaled = (x_pred - x_train.min(axis=0)) / x_train.ptp(axis=0)

# y_pred = w X x_pred_scaled + intercept

What I described above is just for x. In the code you provided, you were scaling y as well, and the way you scaled/unscaled y looks correct.

If your cost is reaching 9e18 something may be going wrong, like almost dividing by zero. At the last iteration, your cost was down to 2.24e-03 - please clarify where the 9e18 is from.

英文:

Looking at the plot you provided, it seems like any x value below ~220 will result in a y value below 0. The negative value you're getting matches with what the red line shows.

There may be an issue with scaling. You're scaling y, but not x. I think it's usually more important to scale x. Assuming x is a matrix of shape samples x features, then you can scale x as follows:


#Scaling training features
# x_train.ptp() calculates x.max() - x.min()
x_train_scaled = (x_train - x_train.min(axis=0)) / x_train.ptp(axis=0)

#fit model using x_train_scaled...

This will scale all the columns (features) in one go. Then, when you want to make a prediction with a new x, scale the new x using the training x values:

x_pred_scaled = (x_pred - x_train.min(axis=0)) / x_train.ptp(axis=0)

#y_pred = w X x_pred_scaled + intercept

What I described above is just for x. In the code you provided, you were scaling y as well, and the way you scaled/unscaled y looks correct.

If your cost is reaching 9e18 something may be going wrong, like almost dividing by zero. At the last iteration your cost was down to 2.24e-03 - please clarify where the 9e18 is from.

huangapple
  • 本文由 发表于 2023年8月5日 05:46:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76839247.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定