2023年3月8日 17:06:14go评论101阅读模式

英文:

Autoencoder for dimensionality reduction of binary dataset for clustering

问题

以下是翻译好的代码部分：

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
input_dim = 31
layer_1_dim = 18
layer_2_dim = 10
bottleneck_dim = 5
learning_rate = 0.001
epochs = 100
batch_size = 300
# 分割数据为训练和验证集
training_n = int(data.shape[0] * 0.8)
train_data = data[:training_n, :]
val_data = data[training_n:, :]
# 定义自动编码器的初始化器
initializer = tf.keras.initializers.GlorotUniform()
# 自动编码器的层
input_layer = Input(shape=(input_dim,))
layer = Dense(layer_1_dim, activation='relu')(input_layer)
layer = Dense(layer_2_dim, activation='relu', kernel_initializer=initializer)(layer)
layer = Dense(bottleneck_dim, activation='relu', kernel_initializer=initializer, name="bottleneck-output")(layer)
layer = Dense(layer_2_dim, activation='relu', kernel_initializer=initializer)(layer) 
layer = Dense(layer_1_dim, activation='relu', kernel_initializer=initializer)(layer)
output_layer = Dense(input_dim, activation='sigmoid', kernel_initializer=initializer)(layer)
# 定义和编译自动编码器模型
autoencoder = Model(inputs=input_layer, outputs=output_layer)
optimizer = Adam(learning_rate=learning_rate)
autoencoder.compile(optimizer=optimizer, loss='binary_crossentropy')
# 训练自动编码器模型
history = autoencoder.fit(train_data, train_data, epochs=epochs, batch_size=batch_size, validation_data=(val_data, val_data))
# 获取瓶颈层的输出
bottleneck_autoencoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('bottleneck-output').output)
bottleneck_output = bottleneck_autoencoder.predict(data)
# 绘制训练和验证集的损失
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Autoencoder loss (binary cross-entropy)')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.savefig('output/embedding.png')

请注意，这是提供的代码的翻译版本。如果您有任何关于代码的具体问题或需要进一步的解释，请随时提出。

英文:

Given a binary dataset (derived from yes/no questionnaire responses) aimed to use for subsequent unsupervised cluster analysis, with significant multicolinearity and a total of 31 features by ~50,000 observations (subjects), it appeared sensible to reduce the dimensionality of the input data before performing cluster analysis. I attempted using an autoencoder for this, but surprisingly, the clusters (derived through k-medoids, chosen due to the nominal nature of the underlying data and a greater stability in relation to outliers/noise compared to e.g., k-means) were actually more distinct and clearly distinguished when using MCA, with a clear maximum Silhouette coefficient at k = 5.

Given that MCA with the first 5 PCs (explaining just ~75% of the variance, chosen through a scree plot) was used before I attempted the autoencoder way, it surprises me that an autoencoder did a worse job at extracting meaningful features at the same bottleneck dimension. The problem with the current autoencoder, appears to be that the data in the bottleneck layer, which is used in the clustering, is distorted...

Below is the code I used to construct the autoencoder. Can it be so that the hyper-parameters are off, or some details of the overall architecture? Random search of specific numbers of the number of layers, learning rate, batch size, dimensions in the layers etc. have not yielded anything substantial. Loss is similar between train and validation dataset, and levels out at around 0.15 after ~40 epochs.

I've also tried to identify studies where such data has been used, but not found anything useful.

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
input_dim = 31
layer_1_dim = 18
layer_2_dim = 10
bottleneck_dim = 5
learning_rate = 0.001
epochs = 100
batch_size = 300
# split data into training and validation
training_n = int(data.shape[0] * 0.8)
train_data = data[:training_n, :]
val_data = data[training_n:, :]
# define autoencoder initializer
initializer = tf.keras.initializers.GlorotUniform()
# autoencoder layers
input_layer = Input(shape=(input_dim,))
layer = Dense(layer_1_dim, activation=&#39;relu&#39;)(input_layer)
layer = Dense(layer_2_dim, activation=&#39;relu&#39;, kernel_initializer=initializer)(layer)
layer = Dense(bottleneck_dim, activation=&#39;relu&#39;, kernel_initializer=initializer, name=&quot;bottleneck-output&quot;)(layer)
layer = Dense(layer_2_dim, activation=&#39;relu&#39;, kernel_initializer=initializer)(layer) 
layer = Dense(layer_1_dim, activation=&#39;relu&#39;, kernel_initializer=initializer)(layer)
output_layer = Dense(input_dim, activation=&#39;sigmoid&#39;, kernel_initializer=initializer)(layer)
# define and compile autoencoder model
autoencoder = Model(inputs=input_layer, outputs=output_layer)
optimizer = Adam(learning_rate=learning_rate)
autoencoder.compile(optimizer=optimizer, loss=&#39;binary_crossentropy&#39;)
# train the autoencoder model
history = autoencoder.fit(train_data, train_data, epochs=epochs, batch_size=batch_size, validation_data=(val_data, val_data))
# get bottleneck output
bottleneck_autoencoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer(&#39;bottleneck-output&#39;).output)
bottleneck_output = bottleneck_autoencoder.predict(data)
# plot loss in traning and validation set
plt.plot(history.history[&#39;loss&#39;])
plt.plot(history.history[&#39;val_loss&#39;])
plt.title(&#39;Autoencoder loss (binary cross-entropy)&#39;)
plt.ylabel(&#39;Loss&#39;)
plt.xlabel(&#39;Epoch&#39;)
plt.legend([&#39;Train&#39;, &#39;Validation&#39;], loc=&#39;upper right&#39;)
plt.savefig(&#39;output/embedding.png&#39;)

答案1

得分: 3

多重共线性的问题在于，除了无法进行有意义的推断之外，它还会妨碍你的收敛速度。这意味着，一切其他条件相同，如果你的数据存在高度的多重共线性，你需要更多的数据才能实现收敛。

根据你的目标是纯预测还是推断，你可能希望对你的输入进行正交化（参见这里和这里），然后将残差输入到你的自动编码器，或者你可能希望选择更明智的激活函数，或者使用稀疏自动编码器。

话虽如此，我建议如果你的目标是获得良好的预测而不必关心推断的话，首先尝试以下两个步骤：

在瓶颈层更改你的激活函数（使用 Sigmoid？）。
使用稀疏自动编码器，参见这里，这里和这里。

英文:

So the problem with multicollinearity is that, on top of not allowing for meaningful inference, it hamper your convergence rate.
This implies that, everything equal, if your data is highly multicollinear you need significantly more data to achieve convergence.

According to what your objective is (pure prediction vs. inference) you might want to either orthogonalize (see here and here) your input and then feed the residuals to your autoencoder or you might want to choose a more sensible activation function or go with a sparse autoencoder.

That said, I would suggest to try the last two steps first if you are aiming to have good predictions and do not necessarily care about inference:

change your activation function in the bottleneck (use a Sigmoid?).
use a sparse autoencoder see here, here and here

答案2

得分: 1

由于这里的主要问题是多重共线性，你可以尝试使用考虑数据内的非线性关系的降维方法。一个常见的例子是核主成分分析（kernelPCA）。然后，你可以探索不同的核函数，通常我使用线性核函数会得到有趣的结果。

英文:

Since the main problem here is multicollinearity, you can try using a dimensionality reduction method that factors in non-linear relationships within the data. One common example is kernelPCA
Then you can explore different kernel functions, I've usually had interesting results with just linear kernel.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Autoencoder用于对二进制数据集进行降维，以便进行聚类。

问题

答案1

答案2

如何使用Plotnine从数据创建带有geomean、mean、max和min的柱状图。

在jinja循环中输出两个参数。

使用神经网络和反向传播来进行数字相加。

如何使用K-Fold交叉验证与DenseNet121模型

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。