英文:
Dimensionality reduction - Pyspark
问题
我的目标是找到特定字体下各种双字节字符之间的视觉相似性。例如,
我想确定伊是否更类似于达还是更类似于市。这个任务需要对13,108个字符进行操作。
为了解决这个问题,我们使用Python中的draw库将所有这些字符转换为灰度图像。然后,我们将所有字符传递到VGG-16(CNN层)以获取它们的特征集。
VGG-16(CNN层)的特征集输出具有512x7x7(25088)个元素。我们将所有这些元素整理成一个文件。现在我们有大约13,108行和25,088列,我的目标是在它们上运行聚类以找出所有字符之间的视觉相似性。为了做到这一点,我需要减少变量(列)的数量。
在最终模型中应该采取的最佳方法是什么,我可以期望保留多少个变量(列)?
英文:
My objective is to find visual similarity between various Double Byte characters when written in a particular font. For instance,
I want to ascertain whether 伊 looks more similar to 達 or more similar to 市. This exercise has to be done for 13,108 characters.
To solve for it we converted all these characters into grey-scale images using the draw library in python. Then we passed all the characters through VGG-16 (CNN Layer) to get a
feature set for them.
The feature set output for VGG-16 (CNN Layer) has 512x7x7 (25088) elements. We collated all these into one file. Now we have around 13,108 rows and 25,088 columns and my aim is to
run clustering on them to find optical similarity among all the characters. In order to do the same I have to reduce the number of variables (Columns).
What should be the most optimal way to do the same and around how many variables (Columns) should I expect to retain for the final model?
答案1
得分: 1
我建议您使用自编码器神经网络,其目标是在输出中重建输入。该网络具有编码层以降低维度,瓶颈层和解码层以在给定瓶颈层的情况下重建输入。请查看下面的自编码器神经网络图像:
您可以使用瓶颈层作为新的变量(列),然后在它们上进行聚类,以找到所有字符之间的光学相似性。这种方法的一个重要优势是,与其他降维方法(如PCA)不同,自编码器执行非线性操作,从而获得更好的结果。
英文:
I suggest you to use an autoencoder neural network, which the objetive is to reconstruct the input in the output. This network has encode layers to reduce the dimensionality, a bottleneck layer and decode layers to reconstruct the input given the bottleneck layer. See below an image of the autoencoder neural network:
You can use the bottleneck layer as your new variables (Columns) and then clustering on them to find optical similarity among all the characters. A big advantage of this approach is that, different then other dimensionality reduction methods like PCA, the autoencoder perform non linear operations, which leads to better results.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论