英文:
How to interpret word2vec train output?
问题
The code snippet you provided reports an output of (3, 60).
英文:
Running the code snippet below report an output (3, 60). I wonder what exactly it is reporting?
The code is reproducible..just copy into a notebook cell and run.
from gensim.models import Word2Vec
sent = [['I', 'love', 'cats'], ['Dogs', 'are', 'friendly']]
w2v_model = Word2Vec(sentences=sent, vector_size=100, window=7, min_count=1,sg=1)
w2v_model.train(sent, total_examples=len(sent), epochs=10)
(3, 60)
答案1
得分: 1
你似乎正在使用Gensim Python库进行Word2Vec训练,对于内部原因,.train()
方法仅返回元组(trained_word_count, raw_word_count)
。
第一个数字是实际训练的单词数量 - 为什么这个数字对你来说只有3,下面会解释 - 第二个数字是传递给训练例程的总原始单词数 - 仅为您的6个单词乘以10个epochs。但是,大多数用户不需要查看这些值。
监视进度的更好方法是将日志级别设置为INFO
- 在这一点上,您将看到模型步骤和进展的许多日志行。通过阅读这些日志,随着时间的推移,您将开始识别良好运行的迹象,或常见错误(例如,当总数或经过的时间似乎与您认为自己在做的事情不一致时)。
不过,您的三行已经有点问题:
- 如果将训练语料库传递给构造函数,则无需再调用
.train()
- 这已经为您自动完成。所以,在这里你是在训练两次。(如果您想要为这种自动训练指定epochs=10
,可以在构造函数中指定它。) - 对于一个微小的玩具大小的语料库,word2vec学不到有用的向量 - 甚至报告更有可能揭示与更实际大小的训练运行无关的奇怪现象。我建议永远不要在少于数十万字的语料库上进行训练,以便所有您的实验都能揭示关于其通常操作的有用信息,减少了对不切实际运行的人为现象的干扰。
- 特别是在这里,因为您总共只有6个单词,每个单词的词频约为所有单词的17%。在任何真实的语料库中,这样的单词都会不现实地频繁 - 因此,所有您的单词都受到一个非常有用的优化的影响:概率高频词降低(由
sample
参数调整)。这就是为什么在60个单词中(6个单词乘以10个epochs),只有3个单词实际上被训练的原因。(在足够大的语料库中,通过删除20次中的19次出现,可以留下足够多的词,模型可以通过相对更多的努力花在较稀有的词上而得到改进。) min_count=1
对于真实的word2vec工作负载基本上总是不明智的选择,因为只出现一次的词无法获得良好的向量,但会浪费模型的时间/状态。忽略这种罕见的词是一种标准做法。(如果您需要这些词的向量,您应该找到足够多的训练材料,足以反复展示它们在上下文中的各种用法。)
英文:
You seem to be using the Gensim Python library for your Word2Vec
, & for internal reasons, the .train()
method returns just the tuple (trained_word_count, raw_word_count)
.
The 1st number happens to be the number of words actually trained on – more on why this is only 3
for you below – & the 2nd the total raw words passed to training routines – just your 6 words times 10 epochs. But, most users never need to consult these values.
A better way to monitor progress is to turn on logging to the INFO
level - at which point you'll see many log lines of the model's steps & progress. By reading these, & over time, you'll start to recognize signs of a good run, or common errors (as when the totals or elapsed times don't seem consistent with what you thought you were doing).
You 3 lines are already a bit off, though:
- If you pass your training corpus into the constructor, you don't have to also call
.train()
- that's already done for you, automatically. So, you're trining twice here. (And, if you wantepochs=10
for that automatic training, you can specify it in the constructor.) - With a tiny toy-sized corpus, word2vec learns no useful vectors – and even the reporting is more likely to reveal oddnesses that are irrelevant to more realistic-sized training runs. I recommend never training on anything less than hundreds-of-thousands of words, so that all your experiments reveal useful things about its usual operation, with minimal distractions from artifacts of unrealistic runs.
- In particular, here, since you only have 6 words total, each has a word frequency of ~17% of all words. In any real corpus, such a word would be unrelaistically super-frequent – and thus all your words fall victim to what is (in real corpora) a very useful optimization: probabilistic highly-frequent-word-dropping (tuned by the
sample
parameter). This is why out of 60 words (6 words times 10 epochs), only 3 word occurrences were actually trained at all. (With truly frequent words in an adequately-sized corpus, dropping 19-out-of-20 appearances leaves plenty, & the overall model gets improved by spending relatively more effort on rarer words.) min_count=1
is essentially always a bad idea with real word2vec workloads, as an words that only appear once can't get good vectors, but do waste model time/state. Ignoring such rare words completely is a standard practice. (If you need vectors for such words, you should find more training material sufficient to demonstrate their varied uses, in context, repeatedly.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论