Autoencoder用于对二进制数据集进行降维,以便进行聚类。

huangapple go评论71阅读模式
英文:

Autoencoder for dimensionality reduction of binary dataset for clustering

问题

以下是翻译好的代码部分:

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt

input_dim = 31
layer_1_dim = 18
layer_2_dim = 10
bottleneck_dim = 5
learning_rate = 0.001
epochs = 100
batch_size = 300

# 分割数据为训练和验证集
training_n = int(data.shape[0] * 0.8)
train_data = data[:training_n, :]
val_data = data[training_n:, :]

# 定义自动编码器的初始化器
initializer = tf.keras.initializers.GlorotUniform()

# 自动编码器的层
input_layer = Input(shape=(input_dim,))
layer = Dense(layer_1_dim, activation='relu')(input_layer)
layer = Dense(layer_2_dim, activation='relu', kernel_initializer=initializer)(layer)
layer = Dense(bottleneck_dim, activation='relu', kernel_initializer=initializer, name="bottleneck-output")(layer)
layer = Dense(layer_2_dim, activation='relu', kernel_initializer=initializer)(layer) 
layer = Dense(layer_1_dim, activation='relu', kernel_initializer=initializer)(layer)
output_layer = Dense(input_dim, activation='sigmoid', kernel_initializer=initializer)(layer)

# 定义和编译自动编码器模型
autoencoder = Model(inputs=input_layer, outputs=output_layer)
optimizer = Adam(learning_rate=learning_rate)
autoencoder.compile(optimizer=optimizer, loss='binary_crossentropy')

# 训练自动编码器模型
history = autoencoder.fit(train_data, train_data, epochs=epochs, batch_size=batch_size, validation_data=(val_data, val_data))

# 获取瓶颈层的输出
bottleneck_autoencoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('bottleneck-output').output)
bottleneck_output = bottleneck_autoencoder.predict(data)

# 绘制训练和验证集的损失
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Autoencoder loss (binary cross-entropy)')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.savefig('output/embedding.png')

请注意,这是提供的代码的翻译版本。如果您有任何关于代码的具体问题或需要进一步的解释,请随时提出。

英文:

Given a binary dataset (derived from yes/no questionnaire responses) aimed to use for subsequent unsupervised cluster analysis, with significant multicolinearity and a total of 31 features by ~50,000 observations (subjects), it appeared sensible to reduce the dimensionality of the input data before performing cluster analysis. I attempted using an autoencoder for this, but surprisingly, the clusters (derived through k-medoids, chosen due to the nominal nature of the underlying data and a greater stability in relation to outliers/noise compared to e.g., k-means) were actually more distinct and clearly distinguished when using MCA, with a clear maximum Silhouette coefficient at k = 5.

Given that MCA with the first 5 PCs (explaining just ~75% of the variance, chosen through a scree plot) was used before I attempted the autoencoder way, it surprises me that an autoencoder did a worse job at extracting meaningful features at the same bottleneck dimension. The problem with the current autoencoder, appears to be that the data in the bottleneck layer, which is used in the clustering, is distorted...

Below is the code I used to construct the autoencoder. Can it be so that the hyper-parameters are off, or some details of the overall architecture? Random search of specific numbers of the number of layers, learning rate, batch size, dimensions in the layers etc. have not yielded anything substantial. Loss is similar between train and validation dataset, and levels out at around 0.15 after ~40 epochs.

I've also tried to identify studies where such data has been used, but not found anything useful.

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
input_dim = 31
layer_1_dim = 18
layer_2_dim = 10
bottleneck_dim = 5
learning_rate = 0.001
epochs = 100
batch_size = 300
# split data into training and validation
training_n = int(data.shape[0] * 0.8)
train_data = data[:training_n, :]
val_data = data[training_n:, :]
# define autoencoder initializer
initializer = tf.keras.initializers.GlorotUniform()
# autoencoder layers
input_layer = Input(shape=(input_dim,))
layer = Dense(layer_1_dim, activation='relu')(input_layer)
layer = Dense(layer_2_dim, activation='relu', kernel_initializer=initializer)(layer)
layer = Dense(bottleneck_dim, activation='relu', kernel_initializer=initializer, name="bottleneck-output")(layer)
layer = Dense(layer_2_dim, activation='relu', kernel_initializer=initializer)(layer) 
layer = Dense(layer_1_dim, activation='relu', kernel_initializer=initializer)(layer)
output_layer = Dense(input_dim, activation='sigmoid', kernel_initializer=initializer)(layer)
# define and compile autoencoder model
autoencoder = Model(inputs=input_layer, outputs=output_layer)
optimizer = Adam(learning_rate=learning_rate)
autoencoder.compile(optimizer=optimizer, loss='binary_crossentropy')
# train the autoencoder model
history = autoencoder.fit(train_data, train_data, epochs=epochs, batch_size=batch_size, validation_data=(val_data, val_data))
# get bottleneck output
bottleneck_autoencoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('bottleneck-output').output)
bottleneck_output = bottleneck_autoencoder.predict(data)
# plot loss in traning and validation set
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Autoencoder loss (binary cross-entropy)')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.savefig('output/embedding.png')

答案1

得分: 3

多重共线性的问题在于,除了无法进行有意义的推断之外,它还会妨碍你的收敛速度。这意味着,一切其他条件相同,如果你的数据存在高度的多重共线性,你需要更多的数据才能实现收敛。

根据你的目标是纯预测还是推断,你可能希望对你的输入进行正交化(参见这里这里),然后将残差输入到你的自动编码器,或者你可能希望选择更明智的激活函数,或者使用稀疏自动编码器。

话虽如此,我建议如果你的目标是获得良好的预测而不必关心推断的话,首先尝试以下两个步骤:

  1. 在瓶颈层更改你的激活函数(使用 Sigmoid?)。
  2. 使用稀疏自动编码器,参见这里这里这里
英文:

So the problem with multicollinearity is that, on top of not allowing for meaningful inference, it hamper your convergence rate.
This implies that, everything equal, if your data is highly multicollinear you need significantly more data to achieve convergence.

According to what your objective is (pure prediction vs. inference) you might want to either orthogonalize (see here and here) your input and then feed the residuals to your autoencoder or you might want to choose a more sensible activation function or go with a sparse autoencoder.

That said, I would suggest to try the last two steps first if you are aiming to have good predictions and do not necessarily care about inference:

  1. change your activation function in the bottleneck (use a Sigmoid?).
  2. use a sparse autoencoder see here, here and here

答案2

得分: 1

由于这里的主要问题是多重共线性,你可以尝试使用考虑数据内的非线性关系的降维方法。一个常见的例子是核主成分分析(kernelPCA)。然后,你可以探索不同的核函数,通常我使用线性核函数会得到有趣的结果。

英文:

Since the main problem here is multicollinearity, you can try using a dimensionality reduction method that factors in non-linear relationships within the data. One common example is kernelPCA
Then you can explore different kernel functions, I've usually had interesting results with just linear kernel.

huangapple
  • 本文由 发表于 2023年3月8日 17:06:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75671125.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定