问题

The code snippet you provided reports an output of (3, 60).

英文:

Running the code snippet below report an output (3, 60). I wonder what exactly it is reporting?

The code is reproducible..just copy into a notebook cell and run.

from gensim.models import Word2Vec    
sent = [[&#39;I&#39;, &#39;love&#39;, &#39;cats&#39;], [&#39;Dogs&#39;, &#39;are&#39;, &#39;friendly&#39;]]
w2v_model = Word2Vec(sentences=sent, vector_size=100, window=7, min_count=1,sg=1)
w2v_model.train(sent, total_examples=len(sent), epochs=10)

(3, 60)

答案1

得分: 1

你似乎正在使用Gensim Python库进行Word2Vec训练，对于内部原因，.train()方法仅返回元组(trained_word_count, raw_word_count)。

第一个数字是实际训练的单词数量 - 为什么这个数字对你来说只有3，下面会解释 - 第二个数字是传递给训练例程的总原始单词数 - 仅为您的6个单词乘以10个epochs。但是，大多数用户不需要查看这些值。

监视进度的更好方法是将日志级别设置为INFO - 在这一点上，您将看到模型步骤和进展的许多日志行。通过阅读这些日志，随着时间的推移，您将开始识别良好运行的迹象，或常见错误（例如，当总数或经过的时间似乎与您认为自己在做的事情不一致时）。

不过，您的三行已经有点问题：

如果将训练语料库传递给构造函数，则无需再调用.train() - 这已经为您自动完成。所以，在这里你是在训练两次。（如果您想要为这种自动训练指定epochs=10，可以在构造函数中指定它。）
对于一个微小的玩具大小的语料库，word2vec学不到有用的向量 - 甚至报告更有可能揭示与更实际大小的训练运行无关的奇怪现象。我建议永远不要在少于数十万字的语料库上进行训练，以便所有您的实验都能揭示关于其通常操作的有用信息，减少了对不切实际运行的人为现象的干扰。
特别是在这里，因为您总共只有6个单词，每个单词的词频约为所有单词的17％。在任何真实的语料库中，这样的单词都会不现实地频繁 - 因此，所有您的单词都受到一个非常有用的优化的影响：概率高频词降低（由sample参数调整）。这就是为什么在60个单词中（6个单词乘以10个epochs），只有3个单词实际上被训练的原因。（在足够大的语料库中，通过删除20次中的19次出现，可以留下足够多的词，模型可以通过相对更多的努力花在较稀有的词上而得到改进。）
min_count=1对于真实的word2vec工作负载基本上总是不明智的选择，因为只出现一次的词无法获得良好的向量，但会浪费模型的时间/状态。忽略这种罕见的词是一种标准做法。（如果您需要这些词的向量，您应该找到足够多的训练材料，足以反复展示它们在上下文中的各种用法。）

英文:

You seem to be using the Gensim Python library for your Word2Vec, & for internal reasons, the .train() method returns just the tuple (trained_word_count, raw_word_count).

The 1st number happens to be the number of words actually trained on – more on why this is only 3 for you below – & the 2nd the total raw words passed to training routines – just your 6 words times 10 epochs. But, most users never need to consult these values.

A better way to monitor progress is to turn on logging to the INFO level - at which point you'll see many log lines of the model's steps & progress. By reading these, & over time, you'll start to recognize signs of a good run, or common errors (as when the totals or elapsed times don't seem consistent with what you thought you were doing).

You 3 lines are already a bit off, though:

If you pass your training corpus into the constructor, you don't have to also call .train() - that's already done for you, automatically. So, you're trining twice here. (And, if you want epochs=10 for that automatic training, you can specify it in the constructor.)
With a tiny toy-sized corpus, word2vec learns no useful vectors – and even the reporting is more likely to reveal oddnesses that are irrelevant to more realistic-sized training runs. I recommend never training on anything less than hundreds-of-thousands of words, so that all your experiments reveal useful things about its usual operation, with minimal distractions from artifacts of unrealistic runs.
In particular, here, since you only have 6 words total, each has a word frequency of ~17% of all words. In any real corpus, such a word would be unrelaistically super-frequent – and thus all your words fall victim to what is (in real corpora) a very useful optimization: probabilistic highly-frequent-word-dropping (tuned by the sample parameter). This is why out of 60 words (6 words times 10 epochs), only 3 word occurrences were actually trained at all. (With truly frequent words in an adequately-sized corpus, dropping 19-out-of-20 appearances leaves plenty, & the overall model gets improved by spending relatively more effort on rarer words.)
min_count=1 is essentially always a bad idea with real word2vec workloads, as an words that only appear once can't get good vectors, but do waste model time/state. Ignoring such rare words completely is a standard practice. (If you need vectors for such words, you should find more training material sufficient to demonstrate their varied uses, in context, repeatedly.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何解释word2vec的训练输出？

问题

答案1

如何解决关于ZeroDivisionError的问题：float division by zero?

如何选择行，如果数据可能没有被排序？

如何在Django中进行自定义管理并使用CSV填充数据库？

PyFMI在Ubuntu 18.04中的Python 3环境中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论