在线性回归中,对价格进行最小-最大缩放后得到负预测。

huangapple go评论107阅读模式
英文:

Getting a negative prediction after min-max scaling the price in a linear regression

问题

I understand your concerns with scaling and descaling your data. It appears that there are some issues with the scaling process in your code. To address these problems, I'll provide corrected code snippets for both scaling and descaling your data. I'll also address the input prediction issue.

Scaling and Descaling Code:

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from sklearn.linear_model import LinearRegression
  4. # Assuming you have x_train, y_train (original data), w_final, and b_final
  5. # Min-max scaling for y_train
  6. min_y_train = np.min(y_train)
  7. max_y_train = np.max(y_train)
  8. y_train_scaled = (y_train - min_y_train) / (max_y_train - min_y_train)
  9. # Descale the predictions using the min-max scaling parameters (min and max)
  10. predictions_scaled = w_final * x_train + b_final
  11. predictions_descaled = predictions_scaled * (max_y_train - min_y_train) + min_y_train
  12. # Plot the original x_train and descaled y_train
  13. plt.scatter(x_train, y_train, label='Original Data')
  14. # Plot the predicted values (descaled)
  15. plt.plot(x_train, predictions_descaled, color='red', label='Predicted Values (Descaled)')
  16. plt.xlabel('x_train')
  17. plt.ylabel('y_train')
  18. plt.title('Original y_train vs. Predicted Values (Descaled)')
  19. plt.legend()
  20. plt.show()

This code correctly scales and descales your data for plotting.

Input Prediction Code:

  1. # Assuming you have already scaled your x_train
  2. # Input your own x for prediction
  3. input_x = 1 # Replace with your desired input
  4. # Scale the input
  5. scaled_input = (input_x - xmin) / (xmax - xmin)
  6. # Make the prediction
  7. prediction_scaled = w_final * scaled_input + b_final
  8. # Descale the prediction using the min-max scaling parameters (min and max)
  9. prediction_descaled = prediction_scaled * (max_y_train - min_y_train) + min_y_train
  10. print("Descaled Prediction:", prediction_descaled)

This code allows you to input your own value for x and obtain a descaled prediction.

Make sure you have correctly scaled your data before using these code snippets. This should help you avoid issues with excessively large or small predictions.

英文:

I was trying to make my mean squared error cost lower by scaling the target feature, primarily because it reaches 1e10's in digit

I use this dataset from kaggle to calculate land price X = LT, Y = Harga https://www.kaggle.com/datasets/wisnuanggara/daftar-harga-rumah

The code i used to input into numpy array:

  1. import os
  2. import openpyxl
  3. from openpyxl import Workbook
  4. import numpy as np
  5. wb = openpyxl.load_workbook('DATA RUMAH.xlsx')
  6. ws = wb.active
  7. y_train_data = np.array([])
  8. x_train_data = np.array([])
  9. def get_x_train():
  10. x_train = np.array([]) # Initialize x_train as a local variable
  11. for x in range(2, 1011):
  12. data = ws.cell(row=x, column=5).value
  13. x_train = np.append(x_train, data)
  14. return x_train
  15. def get_y_train():
  16. y_train = np.array([]) # Initialize y_train as a local variable
  17. for y in range(2, 1011):
  18. data = ws.cell(row=y, column=3).value
  19. y_train = np.append(y_train, data)
  20. return y_train

Linear regression & Gradient Descent Code:

  1. import math, copy
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. from excltool import *
  5. import pandas as pd
  6. import seaborn as sns
  7. %matplotlib inline
  8. # Load our data set
  9. x_train = get_x_train()#features
  10. y_train = get_y_train() #target value
  11. mean = np.mean(y_train)
  12. min = np.min(y_train)
  13. max = np.max(y_train)
  14. y_train = np.array([(i - min) / (max - min) for i in y_train])
  15. #Function to calculate the cost
  16. def compute_cost(x, y, w, b):
  17. m = x.shape[0]
  18. cost = np.float64(0)
  19. for i in range(m):
  20. f_wb = (w * x[i] + b)
  21. cost = (cost + (f_wb - y[i])**2)
  22. total_cost = np.float64(1 / (2 * m) * cost)
  23. return total_cost
  24. def compute_gradient(x, y, w, b):
  25. """
  26. Computes the gradient for linear regression
  27. Args:
  28. x (ndarray (m,)): Data, m examples
  29. y (ndarray (m,)): target values
  30. w,b (scalar) : model parameters
  31. Returns
  32. dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
  33. dj_db (scalar): The gradient of the cost w.r.t. the parameter b
  34. """
  35. # Number of training examples
  36. m = x.shape[0]
  37. dj_dw = np.float64(0)
  38. dj_db = np.float64(0)
  39. for i in range(m):
  40. f_wb = (w * x[i] + b)
  41. dj_dw_i = ((f_wb - y[i]) * x[i])
  42. dj_db_i = (f_wb - y[i])
  43. dj_db += (dj_db_i)
  44. dj_dw += (dj_dw_i)
  45. dj_dw = dj_dw / m
  46. dj_db = dj_db / m
  47. return dj_dw, dj_db
  48. def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
  49. """
  50. Performs gradient descent to fit w,b. Updates w,b by taking
  51. num_iters gradient steps with learning rate alpha
  52. Args:
  53. x (ndarray (m,)) : Data, m examples
  54. y (ndarray (m,)) : target values
  55. w_in,b_in (scalar): initial values of model parameters
  56. alpha (float): Learning rate
  57. num_iters (int): number of iterations to run gradient descent
  58. cost_function: function to call to produce cost
  59. gradient_function: function to call to produce gradient
  60. Returns:
  61. w (scalar): Updated value of parameter after running gradient descent
  62. b (scalar): Updated value of parameter after running gradient descent
  63. J_history (List): History of cost values
  64. p_history (list): History of parameters [w,b]
  65. """
  66. # Specify data type as np.float64 for w, b
  67. w = np.float64(w_in)
  68. b = np.float64(b_in)
  69. # An array to store cost J and w's at each iteration primarily for graphing later
  70. J_history = []
  71. p_history = []
  72. for i in range(num_iters):
  73. # Calculate the gradient and update the parameters using gradient_function
  74. dj_dw, dj_db = gradient_function(x, y, w , b)
  75. # Update Parameters using equation (3) above
  76. b = b - alpha * dj_db
  77. w = np.float64(w - alpha * dj_dw)
  78. # Save cost J at each iteration
  79. J_history.append(cost_function(x, y, w, b))
  80. p_history.append([w, b])
  81. # Print cost every at intervals 10 times or as many iterations if < 10
  82. if i % math.ceil(num_iters/100) == 0:
  83. print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
  84. f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
  85. f"w: {w: 0.3e}, b: {b: 0.5e}")
  86. return w, b, J_history, p_history # Return w and J,w history for graphing
  87. # Initialize parameters with np.float64 data type
  88. w_init = np.float64(0)
  89. b_init = np.float64(0)
  90. # Some gradient descent settings
  91. iterations = 1000000
  92. tmp_alpha = np.float64(1.0e-10)
  93. # Run gradient descent
  94. w_final, b_final, J_hist, p_hist = (gradient_descent(x_train, y_train, w_init, b_init, tmp_alpha,
  95. iterations, compute_cost, compute_gradient))
  96. # Print the result
  97. print(f"(w, b) found by gradient descent: ({w_final}, {b_final})")

got the last result as:

  1. Iteration 950000: Cost 2.24e-03 dj_dw: -9.486e-03, dj_db: 3.615e-03 w: 4.850e-04, b: 9.52354e-07
  2. Iteration 960000: Cost 2.24e-03 dj_dw: -8.682e-03, dj_db: 3.617e-03 w: 4.850e-04, b: 9.48737e-07
  3. Iteration 970000: Cost 2.24e-03 dj_dw: -7.946e-03, dj_db: 3.619e-03 w: 4.850e-04, b: 9.45119e-07
  4. Iteration 980000: Cost 2.24e-03 dj_dw: -7.273e-03, dj_db: 3.621e-03 w: 4.850e-04, b: 9.41499e-07
  5. Iteration 990000: Cost 2.24e-03 dj_dw: -6.657e-03, dj_db: 3.623e-03 w: 4.850e-04, b: 9.37877e-07
  6. (w, b) found by gradient descent: (0.00048503387319465645, 9.34254408473887e-07)

descaled the y_train:

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from sklearn.linear_model import LinearRegression
  4. # Assuming you have x_train, y_train (already scaled), w_final, and b_final
  5. # Descale the y_train using the min-max scaling parameters (min and max)
  6. min_y_train = np.min(y_train)
  7. max_y_train = np.max(y_train)
  8. y_train_descaled = y_train * (max - min) + min
  9. # Compute the predicted values based on the descaled y_train
  10. predictions = w_final * x_train + b_final
  11. # Descale the predictions using the min-max scaling parameters (min and max)
  12. predictions_descaled = predictions * (max - min) + min
  13. # Plot the original x_train and descaled y_train
  14. plt.scatter(x_train, y_train_descaled, label='Original Data')
  15. # Plot the predicted values
  16. plt.plot(x_train, predictions_descaled, color='red', label='Predicted Values')
  17. plt.xlabel('x_train')
  18. plt.ylabel('y_train')
  19. plt.title('Descaled y_train vs. Predicted Values')
  20. plt.legend()
  21. plt.show()

Predicted Plot

made my input prediction:

  1. prediction = w_final*1 + b_final
  2. prediction_descaled = prediction * (max - min) + min
  3. print(prediction_descaled)

which result in -0.11096116854066342 where it shouldnt even be a minus, if it was not scaled everything works fine but my cost reaches 9e18, i wanted to make it lower so that i can better present it

i think i messed up in the descaling process


*EDIT:
i also tried scaling my x

  1. mean = np.mean(x_train)
  2. xmin = np.min(x_train)
  3. xmax = np.max(x_train)
  4. x_train = np.array([(i - xmin) / (xmax - xmin) for i in x_train])

and then plotted the descaled version of both:

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. from sklearn.linear_model import LinearRegression
  4. # Assuming you have x_train, y_train (already scaled), w_final, and b_final
  5. # Descale the y_train using the min-max scaling parameters (min and max)
  6. y_train_descaled = y_train * (ymax - ymin) + ymin
  7. x_train_descaled = x_train * (xmax - xmin) + xmin
  8. # Compute the predicted values based on the descaled y_train
  9. predictions = w_final * x_train + b_final
  10. # Descale the predictions using the min-max scaling parameters (min and max)
  11. predictions_descaled = predictions * (ymax - ymin) + ymin
  12. # Plot the original x_train and descaled y_train
  13. plt.scatter(x_train_descaled, y_train_descaled, label='Original Data')
  14. # Plot the predicted values
  15. plt.plot(x_train_descaled, predictions_descaled, color='red', label='Predicted Values')
  16. plt.xlabel('x_train')
  17. plt.ylabel('y_train')
  18. plt.title('Descaled y_train vs. Predicted Values')
  19. plt.legend()
  20. plt.show()

Descaled_Plot

and tried to input a prediction of my own x(1):

  1. scaled_input = (1 - xmin) / (xmax - xmin)
  2. prediction = w_final*scaled_input + b_final
  3. prediction_descaled = prediction * (ymax - ymin) + ymin
  4. print(prediction_descaled)

but got a result of 92720747 which is too big because the non-scaled code outputted a more reasonable 33047038

答案1

得分: 0

Looking at the plot you provided, it seems like any x value below ~220 will result in a y value below 0. The negative value you're getting matches with what the red line shows.

There may be an issue with scaling. You're scaling y, but not x. I think it's usually more important to scale x. Assuming x is a matrix of shape samples x features, then you can scale x as follows:

  1. # Scaling training features
  2. # x_train.ptp() calculates x.max() - x.min()
  3. x_train_scaled = (x_train - x_train.min(axis=0)) / x_train.ptp(axis=0)
  4. # fit model using x_train_scaled...

This will scale all the columns (features) in one go. Then, when you want to make a prediction with a new x, scale the new x using the training x values:

  1. x_pred_scaled = (x_pred - x_train.min(axis=0)) / x_train.ptp(axis=0)
  2. # y_pred = w X x_pred_scaled + intercept

What I described above is just for x. In the code you provided, you were scaling y as well, and the way you scaled/unscaled y looks correct.

If your cost is reaching 9e18 something may be going wrong, like almost dividing by zero. At the last iteration, your cost was down to 2.24e-03 - please clarify where the 9e18 is from.

英文:

Looking at the plot you provided, it seems like any x value below ~220 will result in a y value below 0. The negative value you're getting matches with what the red line shows.

There may be an issue with scaling. You're scaling y, but not x. I think it's usually more important to scale x. Assuming x is a matrix of shape samples x features, then you can scale x as follows:

  1. #Scaling training features
  2. # x_train.ptp() calculates x.max() - x.min()
  3. x_train_scaled = (x_train - x_train.min(axis=0)) / x_train.ptp(axis=0)
  4. #fit model using x_train_scaled...

This will scale all the columns (features) in one go. Then, when you want to make a prediction with a new x, scale the new x using the training x values:

  1. x_pred_scaled = (x_pred - x_train.min(axis=0)) / x_train.ptp(axis=0)
  2. #y_pred = w X x_pred_scaled + intercept

What I described above is just for x. In the code you provided, you were scaling y as well, and the way you scaled/unscaled y looks correct.

If your cost is reaching 9e18 something may be going wrong, like almost dividing by zero. At the last iteration your cost was down to 2.24e-03 - please clarify where the 9e18 is from.

huangapple
  • 本文由 发表于 2023年8月5日 05:46:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76839247.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定