英文:
Logistic Regresion with gradient descension mini batch
问题
我开发了一个梯度下降算法,但当我尝试将其与一些sklearn示例一起使用时,结果不正确,我不知道如何修复它。
以下是完整的算法:
首先,我有一个用于异常的类和一个名为rendimiento()
的计算得分的函数:
class ClasificadorNoEntrenado(Exception): pass
def rendimiento(clasificador, X, y):
aciertos = 0
total_ejemplos = len(X)
for i in range(total_ejemplos):
ejemplo = X[i]
clasificacion_esperada = y[i]
clasificacion_obtenida = clasificador.clasifica(ejemplo)
if clasificacion_obtenida == clasificacion_esperada:
aciertos += 1
accuracy = aciertos / total_ejemplos
return accuracy
其次,我有一个计算sigmoid的函数:
from scipy.special import expit
def sigmoide(x):
return expit(x)
第三,我有主要的算法:
class RegresionLogisticaMiniBatch():
def __init__(self, clases=[0,1], normalizacion=False,
rate=0.1, rate_decay=False, batch_tam=64):
self.clases = clases
self.rate = rate
self.normalizacion = normalizacion
self.rate_decay = rate_decay
self.batch_tam = batch_tam
self.pesos = None
self.media = None
self.desviacion = None
def entrena(self, X, y, n_epochs, reiniciar_pesos=False, pesos_iniciales=None):
self.X = X
self.y = y
self.n_epochs = n_epochs
if reiniciar_pesos or self.pesos is None:
self.pesos = pesos_iniciales if pesos_iniciales is not None else np.random.uniform(-1, 1, size=X.shape[1])
if self.normalizacion:
self.media = np.mean(X, axis=0)
self.desviacion = np.std(X, axis=0)
indices = np.random.permutation(len(X))
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, len(X), self.batch_tam):
batch_X = X_shuffled[i:i + self.batch_tam]
batch_y = y_shuffled[i:i + self.batch_tam]
# 计算逻辑函数(sigmoid)
z = np.dot(batch_X, self.pesos)
y_pred = sigmoide(z)
# 计算梯度
error = batch_y - y_pred
gradiente = np.dot(batch_X.T, error) / len(batch_X)
# 更新权重
self.pesos += self.rate * gradiente
def clasifica_prob(self, ejemplo):
if self.pesos is None:
raise ClasificadorNoEntrenado("El clasificador no ha sido entrenado")
if self.normalizacion:
ejemplo = (ejemplo - self.media) / self.desviacion
probabilidad = sigmoide(np.dot(ejemplo, self.pesos))
if probabilidad >= 0.5:
return 1
else:
return 0
def clasifica(self, ejemplo):
probabilidad = self.clasifica_prob(ejemplo)
return probabilidad
最后,我尝试使用sklearn的数据集来查看是否正确:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_cancer, y_cancer = cancer.data, cancer.target
lr_cancer = RegresionLogisticaMiniBatch(rate=0.1, rate_decay=True, normalizacion=True)
Xe_cancer, Xt_cancer, ye_cancer, yt_cancer = train_test_split(X_cancer, y_cancer)
lr_cancer.entrena(Xe_cancer, ye_cancer, 10000)
print(rendimiento(lr_cancer, Xe_cancer, ye_cancer))
print(rendimiento(lr_cancer, Xt_cancer, yt_cancer))
但结果非常随机且低。
我尝试开发了一个具有梯度下降和小批量算法的逻辑回归,但它无法正确预测,希望有人能帮助我修复这个问题。
英文:
I developed an algorithm of gradient descent, but when I try it with some sklearn exampples the results are incorrect and I do not know how to fix it.
This is the full algorithm:
First of all I have a class for an exception and a function to calculate score called rendimiento():
class ClasificadorNoEntrenado(Exception): pass
def rendimiento(clasificador, X, y):
aciertos = 0
total_ejemplos = len(X)
for i in range(total_ejemplos):
ejemplo = X[i]
clasificacion_esperada = y[i]
clasificacion_obtenida = clasificador.clasifica(ejemplo)
if clasificacion_obtenida == clasificacion_esperada:
aciertos += 1
accuracy = aciertos / total_ejemplos
return accuracy
Second, I have a function that calculates sigmoide:
from scipy.special import expit
def sigmoide(x):
return expit(x)
Third, I have the main algorithm:
class RegresionLogisticaMiniBatch():
def __init__(self,clases=[0,1],normalizacion=False,
rate=0.1,rate_decay=False,batch_tam=64):
self.clases = clases;
self.rate = rate;
self.normalizacion = normalizacion;
self.rate_decay = rate_decay;
self.batch_tam = batch_tam;
self.pesos = None
self.media = None
self.desviacion = None
def entrena(self, X, y, n_epochs, reiniciar_pesos=False, pesos_iniciales=None):
self.X = X
self.y = y
self.n_epochs = n_epochs
if reiniciar_pesos or self.pesos is None:
self.pesos = pesos_iniciales if pesos_iniciales is not None else np.random.uniform(-1, 1, size=X.shape[1])
if self.normalizacion:
self.media = np.mean(X, axis=0)
self.desviacion = np.std(X, axis=0)
indices = np.random.permutation(len(X))
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, len(X), self.batch_tam):
batch_X = X_shuffled[i:i + self.batch_tam]
batch_y = y_shuffled[i:i + self.batch_tam]
# Compute logistic function (sigmoid)
z = np.dot(batch_X, self.pesos)
y_pred = sigmoide(z)
# Compute gradient
error = batch_y - y_pred
gradiente = np.dot(batch_X.T, error) / len(batch_X)
# Update weights
self.pesos += self.rate * gradiente
def clasifica_prob(self, ejemplo):
if self.pesos is None:
raise ClasificadorNoEntrenado("El clasificador no ha sido entrenado")
if self.normalizacion:
ejemplo = (ejemplo - self.media) / self.desviacion
probabilidad = sigmoide(np.dot(ejemplo, self.pesos))
if probabilidad >= 0.5:
return 1
else:
return 0
#return {'no': 1 - probabilidad, 'si': probabilidad}
def clasifica(self,ejemplo):
probabilidad = self.clasifica_prob(ejemplo)
return probabilidad
And finally I try to see if it is correct with a dataset of sklearn:
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
X_cancer,y_cancer=cancer.data,cancer.target
lr_cancer=RegresionLogisticaMiniBatch(rate=0.1,rate_decay=True,normalizacion=True)
Xe_cancer, Xt_cancer, ye_cancer, yt_cancer = train_test_split(X_cancer, y_cancer);
lr_cancer.entrena(Xe_cancer,ye_cancer,10000)
print(rendimiento(lr_cancer,Xe_cancer,ye_cancer))
print(rendimiento(lr_cancer,Xt_cancer,yt_cancer))
But the results are very random and low.
I tried to developed a Logistic Regression with gradient descent and mini batch algorithm but it not predicts correct, hope someone could help me to fix this.
答案1
得分: 1
这里有两个问题。
第一个问题是归一化。self.normalizacion
变量用于控制是否使用归一化。然而,当它被设置时,它会影响预测,但不影响训练。
如果你的模型的权重是在一个非归一化的数据集上学习的,它们在一个归一化的数据集上表现会很差。
我建议将你的代码更改如下,以在训练期间进行归一化:
# 在 entrena 函数内部
if self.normalizacion:
self.media = np.mean(X, axis=0)
self.desviacion = np.std(X, axis=0)
X = (X - self.media) / self.desviacion
第二个问题是训练周期(epochs)。训练周期是指模型多次重新对相同的数据集进行训练,以便随机分配的权重更接近其理想值。你有一个 n_epochs
变量,但它没有起作用。
因此,我建议使用两个循环。外部循环迭代训练周期,内部循环迭代小批次数据。
# 在 entrena 函数内部
for j in range(n_epochs):
for i in range(0, len(X), self.batch_tam):
# 与之前相同的循环
通过这些更改,我可以在训练集上获得99%的准确率,在测试集上获得95%的准确率。
英文:
There are two problems that I see here.
The first problem is normalization. The self.normalizacion
variable is used to control whether normalization is used. However, when it is set, it affects prediction, but not training.
If the weights of your model are learned on a non-normalized dataset, they will perform very poorly on a normalized dataset.
I suggest changing your code like this to normalize during training:
# within entrena
if self.normalizacion:
self.media = np.mean(X, axis=0)
self.desviacion = np.std(X, axis=0)
X = (X - self.media) / self.desviacion
The second problem is epochs. Training epochs refers to when a model re-trains on the same dataset many times, so that randomly assigned weights move closer to their ideal values. You have an n_epochs
variable, but it does not do anything.
Therefore, I suggest having two loops. The outer loop loops over epochs. The inner one loops over minibatches.
# within entrena
for j in range(n_epochs):
for i in range(0, len(X), self.batch_tam):
# same loop as before
With these changes I can get 99% accuracy on the training set, and 95% accuracy on the test set.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论