Logistic Regression with Gradient Descent Mini Batch

huangapple go评论64阅读模式
英文:

Logistic Regresion with gradient descension mini batch

问题

我开发了一个梯度下降算法,但当我尝试将其与一些sklearn示例一起使用时,结果不正确,我不知道如何修复它。
以下是完整的算法:

首先,我有一个用于异常的类和一个名为rendimiento()的计算得分的函数:

class ClasificadorNoEntrenado(Exception): pass

def rendimiento(clasificador, X, y):
    aciertos = 0
    total_ejemplos = len(X)

    for i in range(total_ejemplos):
        ejemplo = X[i]
        clasificacion_esperada = y[i]
        clasificacion_obtenida = clasificador.clasifica(ejemplo)

        if clasificacion_obtenida == clasificacion_esperada:
            aciertos += 1

    accuracy = aciertos / total_ejemplos
    return accuracy

其次,我有一个计算sigmoid的函数:

from scipy.special import expit

def sigmoide(x):
    return expit(x)

第三,我有主要的算法:

class RegresionLogisticaMiniBatch():

    def __init__(self, clases=[0,1], normalizacion=False,
                  rate=0.1, rate_decay=False, batch_tam=64):
    
        self.clases = clases
        self.rate = rate
        self.normalizacion = normalizacion
        self.rate_decay = rate_decay
        self.batch_tam = batch_tam
        self.pesos = None
        self.media = None
        self.desviacion = None
    
    def entrena(self, X, y, n_epochs, reiniciar_pesos=False, pesos_iniciales=None):
        self.X = X
        self.y = y
        self.n_epochs = n_epochs
        
        if reiniciar_pesos or self.pesos is None:
            self.pesos = pesos_iniciales if pesos_iniciales is not None else np.random.uniform(-1, 1, size=X.shape[1])
    
        if self.normalizacion:
            self.media = np.mean(X, axis=0)
            self.desviacion = np.std(X, axis=0)
    
        indices = np.random.permutation(len(X))
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        for i in range(0, len(X), self.batch_tam):
            batch_X = X_shuffled[i:i + self.batch_tam]
            batch_y = y_shuffled[i:i + self.batch_tam]
        
            # 计算逻辑函数(sigmoid)
            z = np.dot(batch_X, self.pesos)
            y_pred = sigmoide(z)
    
            # 计算梯度
            error = batch_y - y_pred
            gradiente = np.dot(batch_X.T, error) / len(batch_X)
    
            # 更新权重
            self.pesos += self.rate * gradiente
    
    def clasifica_prob(self, ejemplo):
        if self.pesos is None:
            raise ClasificadorNoEntrenado("El clasificador no ha sido entrenado")
    
        if self.normalizacion:
            ejemplo = (ejemplo - self.media) / self.desviacion
    
        probabilidad = sigmoide(np.dot(ejemplo, self.pesos))
        if probabilidad >= 0.5:
            return 1
        else:
            return 0
    
    def clasifica(self, ejemplo):
        probabilidad = self.clasifica_prob(ejemplo)
        return probabilidad

最后,我尝试使用sklearn的数据集来查看是否正确:

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

X_cancer, y_cancer = cancer.data, cancer.target

lr_cancer = RegresionLogisticaMiniBatch(rate=0.1, rate_decay=True, normalizacion=True)

Xe_cancer, Xt_cancer, ye_cancer, yt_cancer = train_test_split(X_cancer, y_cancer)

lr_cancer.entrena(Xe_cancer, ye_cancer, 10000)

print(rendimiento(lr_cancer, Xe_cancer, ye_cancer))

print(rendimiento(lr_cancer, Xt_cancer, yt_cancer))

但结果非常随机且低。
我尝试开发了一个具有梯度下降和小批量算法的逻辑回归,但它无法正确预测,希望有人能帮助我修复这个问题。

英文:

I developed an algorithm of gradient descent, but when I try it with some sklearn exampples the results are incorrect and I do not know how to fix it.
This is the full algorithm:

First of all I have a class for an exception and a function to calculate score called rendimiento():

class ClasificadorNoEntrenado(Exception): pass
def rendimiento(clasificador, X, y):
aciertos = 0
total_ejemplos = len(X)
for i in range(total_ejemplos):
ejemplo = X[i]
clasificacion_esperada = y[i]
clasificacion_obtenida = clasificador.clasifica(ejemplo)
if clasificacion_obtenida == clasificacion_esperada:
aciertos += 1
accuracy = aciertos / total_ejemplos
return accuracy

Second, I have a function that calculates sigmoide:

from scipy.special import expit
def sigmoide(x):
return expit(x)

Third, I have the main algorithm:

class RegresionLogisticaMiniBatch():
def __init__(self,clases=[0,1],normalizacion=False,
rate=0.1,rate_decay=False,batch_tam=64):
self.clases = clases;
self.rate = rate;
self.normalizacion = normalizacion;
self.rate_decay = rate_decay;
self.batch_tam = batch_tam;
self.pesos = None
self.media = None
self.desviacion = None
def entrena(self, X, y, n_epochs, reiniciar_pesos=False, pesos_iniciales=None):
self.X = X
self.y = y
self.n_epochs = n_epochs
if reiniciar_pesos or self.pesos is None:
self.pesos = pesos_iniciales if pesos_iniciales is not None else np.random.uniform(-1, 1, size=X.shape[1])
if self.normalizacion:
self.media = np.mean(X, axis=0)
self.desviacion = np.std(X, axis=0)
indices = np.random.permutation(len(X))
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, len(X), self.batch_tam):
batch_X = X_shuffled[i:i + self.batch_tam]
batch_y = y_shuffled[i:i + self.batch_tam]
# Compute logistic function (sigmoid)
z = np.dot(batch_X, self.pesos)
y_pred = sigmoide(z)
# Compute gradient
error = batch_y - y_pred
gradiente = np.dot(batch_X.T, error) / len(batch_X)
# Update weights
self.pesos += self.rate * gradiente
def clasifica_prob(self, ejemplo):
if self.pesos is None:
raise ClasificadorNoEntrenado("El clasificador no ha sido entrenado")
if self.normalizacion:
ejemplo = (ejemplo - self.media) / self.desviacion
probabilidad = sigmoide(np.dot(ejemplo, self.pesos))
if probabilidad >= 0.5:
return 1
else:
return 0
#return {'no': 1 - probabilidad, 'si': probabilidad} 
def clasifica(self,ejemplo):
probabilidad = self.clasifica_prob(ejemplo)
return probabilidad

And finally I try to see if it is correct with a dataset of sklearn:

from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
X_cancer,y_cancer=cancer.data,cancer.target
lr_cancer=RegresionLogisticaMiniBatch(rate=0.1,rate_decay=True,normalizacion=True)
Xe_cancer, Xt_cancer, ye_cancer, yt_cancer = train_test_split(X_cancer, y_cancer);
lr_cancer.entrena(Xe_cancer,ye_cancer,10000)
print(rendimiento(lr_cancer,Xe_cancer,ye_cancer))
print(rendimiento(lr_cancer,Xt_cancer,yt_cancer))

But the results are very random and low.

I tried to developed a Logistic Regression with gradient descent and mini batch algorithm but it not predicts correct, hope someone could help me to fix this.

答案1

得分: 1

这里有两个问题。

第一个问题是归一化。self.normalizacion 变量用于控制是否使用归一化。然而,当它被设置时,它会影响预测,但不影响训练。

如果你的模型的权重是在一个非归一化的数据集上学习的,它们在一个归一化的数据集上表现会很差。

我建议将你的代码更改如下,以在训练期间进行归一化:

        # 在 entrena 函数内部
        if self.normalizacion:
            self.media = np.mean(X, axis=0)
            self.desviacion = np.std(X, axis=0)
            X = (X - self.media) / self.desviacion

第二个问题是训练周期(epochs)。训练周期是指模型多次重新对相同的数据集进行训练,以便随机分配的权重更接近其理想值。你有一个 n_epochs 变量,但它没有起作用。

因此,我建议使用两个循环。外部循环迭代训练周期,内部循环迭代小批次数据。

        # 在 entrena 函数内部
        for j in range(n_epochs):
            for i in range(0, len(X), self.batch_tam):
                # 与之前相同的循环

通过这些更改,我可以在训练集上获得99%的准确率,在测试集上获得95%的准确率。

英文:

There are two problems that I see here.

The first problem is normalization. The self.normalizacion variable is used to control whether normalization is used. However, when it is set, it affects prediction, but not training.

If the weights of your model are learned on a non-normalized dataset, they will perform very poorly on a normalized dataset.

I suggest changing your code like this to normalize during training:

        # within entrena
if self.normalizacion:
self.media = np.mean(X, axis=0)
self.desviacion = np.std(X, axis=0)
X = (X - self.media) / self.desviacion

The second problem is epochs. Training epochs refers to when a model re-trains on the same dataset many times, so that randomly assigned weights move closer to their ideal values. You have an n_epochs variable, but it does not do anything.

Therefore, I suggest having two loops. The outer loop loops over epochs. The inner one loops over minibatches.

        # within entrena
for j in range(n_epochs):
for i in range(0, len(X), self.batch_tam):
# same loop as before

With these changes I can get 99% accuracy on the training set, and 95% accuracy on the test set.

huangapple
  • 本文由 发表于 2023年6月13日 01:55:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459172.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定