英文:
Queries regarding the working of "Word2Vec" vectorizer to convert text to numeric representation
问题
我在一个预测(分类)模型上工作。我使用Word2Vec将文本列的数据转换为数字,然后运行了机器学习算法。
对于Word2Vec的工作原理,我有以下疑问:
-
当我检查一个句子中每个单词的向量表示时,我得到一个包含100个数字/向量的数组。所有这些数字代表什么意思?我知道每个数字对应一个维度,但在这个上下文中(关于向量空间),维度是什么意思?
-
在对“神经网络”进行Word2Vec模型训练时,句子中的每个单词被作为输入馈送到输入层,并且单词被进行独热编码。因此,被馈送的单词的向量表示将类似于 [1 0 0 0 0 0 0] 和 [0 0 1 0 0 0 0]。
这些向量使用随机权重进行初始化。输入的加权和被传递到下一层(隐藏层)。我的疑问是,当与0相乘的权重最终仍然保持为0时,给这些单词向量分配随机权重的意义是什么?
神经网络如何在稀疏数据中传递信息?
注:我参考了许多来源,这些问题是基于我的理解提出的。如果我对任何概念的解释有误,请告诉我。谢谢您。
英文:
I worked on a predictive (classification) model. I used Word2Vec to convert the data is textual columns to numeric, following which I ran the machine learning algorithms.
I have the following doubts regarding the working of Word2Vec:
-
When I check the vector representation of each word of a sentence, I get an array of 100 numbers/vectors. What do all these numbers mean? I know that each number corresponds to a dimension, but what is a dimension in this context (with regard to the vector space)?
-
When training the Word2Vec model on a 'Neural Network', each word in a sentence is fed as input to the input layer & the words are one-hot encoded. So the vector representation of the words being fed would be something like, [1 0 0 0 0 0 0] & [0 0 1 0 0 0 0].
These vectors are initialized with random weights. The weighted sum of inputs is transmitted to the next layer (Hidden Layer). My doubt is, what is the point of assigning random weights to these word vectors when the weights that are being multiplied with the 0s will anyways remain 0?
How is the neural network transmitting information across with sparse data?
Note: I referred to many sources & these questions are being asked based on my understanding. Do let me know if I have interpreted any concept wrong. Thank you.
答案1
得分: 1
关于word2vec嵌入的维度,我不知道你是否能说出维度代表什么。这些是嵌入空间中的向量,我会假设每个维度可以对应于某个语义概念(或语义概念的组合),但很难或者不可能说出每个维度代表什么。也许一个维度可能表示单词是否是动词,另一个维度可能表示单词是否是名词,等等,但这些只是猜测。
关于第二个问题,对于每个单词,与0相乘的权重会略有不同,因为对于每个单词,1的位置在one-hot编码中会改变。因此,在某个时间点上,每个权重都会被1相乘。如果你有一个包含整个词汇表的大型训练数据集,这将在训练过程中发生。否则,这可能会在测试过程中发生。此外,每个神经元都有一个偏置项。因此,即使输入是稀疏的,第一个隐藏层也不再是稀疏的。因此,你可以将第一层(你提到的隐藏层)解释为从one-hot编码空间到一个去除稀疏性的密集嵌入空间的映射。
英文:
Regarding the dimension of the word2vec embeddings, I don't know if you can say anything about what a dimension represent. These are vectors in the embedding space, and I would assume that each dimension can correspond to some semantic concept (or a combination of semantic concepts), but it is very hard or impossible to say what each dimension represents. Maybe one dimension may indicate if the word is verb, other if the word is a noun, etc., but these are just speculations.
Regarding the second question, for each word the weights that are multiplied by 0 will be slightly different, as for each word, the position of 1 will change in the one-hot encoding. Thus, at some point in time every weight will have been multiplied by 1. If you have a big training dataset that contains your whole vocabulary, this will happen during training. Otherwise it may happen during testing. Moreover, each neuron has a bias term. So even if the input is sparse, the first hidden layer will not be sparse anymore. Thus, you could interpret the first layer (the hidden layer in your question) as a mapping from the one-hot encoding space to a dense embedding space, where the sparsity is removed.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论