如何解释word2vec的训练输出?

huangapple go评论65阅读模式
英文:

How to interpret word2vec train output?

问题

The code snippet you provided reports an output of (3, 60).

英文:

Running the code snippet below report an output (3, 60). I wonder what exactly it is reporting?

The code is reproducible..just copy into a notebook cell and run.

from gensim.models import Word2Vec    
sent = [['I', 'love', 'cats'], ['Dogs', 'are', 'friendly']]
w2v_model = Word2Vec(sentences=sent, vector_size=100, window=7, min_count=1,sg=1)
w2v_model.train(sent, total_examples=len(sent), epochs=10)

(3, 60)

答案1

得分: 1

你似乎正在使用Gensim Python库进行Word2Vec训练,对于内部原因,.train()方法仅返回元组(trained_word_count, raw_word_count)

第一个数字是实际训练的单词数量 - 为什么这个数字对你来说只有3,下面会解释 - 第二个数字是传递给训练例程的总原始单词数 - 仅为您的6个单词乘以10个epochs。但是,大多数用户不需要查看这些值。

监视进度的更好方法是将日志级别设置为INFO - 在这一点上,您将看到模型步骤和进展的许多日志行。通过阅读这些日志,随着时间的推移,您将开始识别良好运行的迹象,或常见错误(例如,当总数或经过的时间似乎与您认为自己在做的事情不一致时)。

不过,您的三行已经有点问题:

  • 如果将训练语料库传递给构造函数,则无需再调用.train() - 这已经为您自动完成。所以,在这里你是在训练两次。(如果您想要为这种自动训练指定epochs=10,可以在构造函数中指定它。)
  • 对于一个微小的玩具大小的语料库,word2vec学不到有用的向量 - 甚至报告更有可能揭示与更实际大小的训练运行无关的奇怪现象。我建议永远不要在少于数十万字的语料库上进行训练,以便所有您的实验都能揭示关于其通常操作的有用信息,减少了对不切实际运行的人为现象的干扰。
  • 特别是在这里,因为您总共只有6个单词,每个单词的词频约为所有单词的17%。在任何真实的语料库中,这样的单词都会不现实地频繁 - 因此,所有您的单词都受到一个非常有用的优化的影响:概率高频词降低(由sample参数调整)。这就是为什么在60个单词中(6个单词乘以10个epochs),只有3个单词实际上被训练的原因。(在足够大的语料库中,通过删除20次中的19次出现,可以留下足够多的词,模型可以通过相对更多的努力花在较稀有的词上而得到改进。)
  • min_count=1对于真实的word2vec工作负载基本上总是不明智的选择,因为只出现一次的词无法获得良好的向量,但会浪费模型的时间/状态。忽略这种罕见的词是一种标准做法。(如果您需要这些词的向量,您应该找到足够多的训练材料,足以反复展示它们在上下文中的各种用法。)
英文:

You seem to be using the Gensim Python library for your Word2Vec, & for internal reasons, the .train() method returns just the tuple (trained_word_count, raw_word_count).

The 1st number happens to be the number of words actually trained on – more on why this is only 3 for you below – & the 2nd the total raw words passed to training routines – just your 6 words times 10 epochs. But, most users never need to consult these values.

A better way to monitor progress is to turn on logging to the INFO level - at which point you'll see many log lines of the model's steps & progress. By reading these, & over time, you'll start to recognize signs of a good run, or common errors (as when the totals or elapsed times don't seem consistent with what you thought you were doing).

You 3 lines are already a bit off, though:

  • If you pass your training corpus into the constructor, you don't have to also call .train() - that's already done for you, automatically. So, you're trining twice here. (And, if you want epochs=10 for that automatic training, you can specify it in the constructor.)
  • With a tiny toy-sized corpus, word2vec learns no useful vectors – and even the reporting is more likely to reveal oddnesses that are irrelevant to more realistic-sized training runs. I recommend never training on anything less than hundreds-of-thousands of words, so that all your experiments reveal useful things about its usual operation, with minimal distractions from artifacts of unrealistic runs.
  • In particular, here, since you only have 6 words total, each has a word frequency of ~17% of all words. In any real corpus, such a word would be unrelaistically super-frequent – and thus all your words fall victim to what is (in real corpora) a very useful optimization: probabilistic highly-frequent-word-dropping (tuned by the sample parameter). This is why out of 60 words (6 words times 10 epochs), only 3 word occurrences were actually trained at all. (With truly frequent words in an adequately-sized corpus, dropping 19-out-of-20 appearances leaves plenty, & the overall model gets improved by spending relatively more effort on rarer words.)
  • min_count=1 is essentially always a bad idea with real word2vec workloads, as an words that only appear once can't get good vectors, but do waste model time/state. Ignoring such rare words completely is a standard practice. (If you need vectors for such words, you should find more training material sufficient to demonstrate their varied uses, in context, repeatedly.

huangapple
  • 本文由 发表于 2023年5月22日 00:52:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76300983.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定