Gensim Pickle错误: 无法加载保存的主题模型

huangapple go评论56阅读模式
英文:

Gensim Pickle Error: Enable to Load the Saved Topic Model

问题

我正在进行主题推断的工作,这将需要加载一个先前保存的模型。

然而,我遇到了一个叫做Pickle错误的问题,错误信息如下:

Traceback (most recent call last):
  File "topic_inference.py", line 35, in <module>
    model_for_inference = gensim.models.LdaModel.load(model_name, mmap = 'r')
  File "topic_modeling/env/lib/python3.8/site-packages/gensim/models/ldamodel.py", line 1663, in load
    result = super(LdaModel, cls).load(fname, *args, **kwargs)
  File "topic_modeling/env/lib/python3.8/site-packages/gensim/utils.py", line 486, in load
    obj = unpickle(fname)
  File "topic_modeling/env/lib/python3.8/site-packages/gensim/utils.py", line 1461, in unpickle
    return _pickle.load(f, encoding='latin1')  # needed because loading from S3 doesn't support readline()
TypeError: __randomstate_ctor() 接受从0到1个位置参数但提供了2个

我用于加载模型的代码如下:

gensim.models.LdaModel.load(model_name, mmap = 'r')

以下是我用于创建和保存模型的代码:

model = gensim.models.ldamulticore.LdaMulticore(
        corpus=comment_corpus,
        id2word=key_word_dict, ## 这现在是一个gensim.corpora.Dictionary对象,以前是.id2token属性
        chunksize=chunksize,
        alpha='symmetric',
        eta='auto',
        iterations=iterations,
        num_topics=num_topics,
        passes=epochs,
        eval_every=eval_every, 
        workers = 15,
        minimum_probability= 0.0)

model.save(output_model)

其中output_model 没有像 .model.pkl 这样的扩展名。

在过去,我尝试了类似的方法,唯一的区别是,当我创建模型时,我传递了一个.id2token属性而不是完整的gensim.corpora.Dictionary对象给id2word参数,而那时该方法可以成功加载模型。我想知道是否传递一个corpora.Dictionary对象在加载输出时会有什么区别...? 那个时候,我使用的是普通的Python,但现在我正在使用Anaconda。但是,所有包的版本都是相同的。

英文:

I am working on topic inference that will require to load a previously saved model.

However, I got a pickle error that says

Traceback (most recent call last):
  File &quot;topic_inference.py&quot;, line 35, in &lt;module&gt;
    model_for_inference = gensim.models.LdaModel.load(model_name, mmap = &#39;r&#39;)
  File &quot;topic_modeling/env/lib/python3.8/site-packages/gensim/models/ldamodel.py&quot;, line 1663, in load
    result = super(LdaModel, cls).load(fname, *args, **kwargs)
  File &quot;topic_modeling/env/lib/python3.8/site-packages/gensim/utils.py&quot;, line 486, in load
    obj = unpickle(fname)
  File &quot;topic_modeling/env/lib/python3.8/site-packages/gensim/utils.py&quot;, line 1461, in unpickle
    return _pickle.load(f, encoding=&#39;latin1&#39;)  # needed because loading from S3 doesn&#39;t support readline()
TypeError: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given

The code I use to load the model is simply

gensim.models.LdaModel.load(model_name, mmap = &#39;r&#39;)

Here is the code that I use to create and save the model

 model = gensim.models.ldamulticore.LdaMulticore(
        corpus=comment_corpus,
        id2word=key_word_dict, ## This is now a gensim.corpora.Dictionary Object, previously it was the .id2token attribute
        chunksize=chunksize,
        alpha=&#39;symmetric&#39;,
        eta=&#39;auto&#39;,
        iterations=iterations,
        num_topics=num_topics,
        passes=epochs,
        eval_every=eval_every, 
        workers = 15,
        minimum_probability= 0.0)

model.save(output_model)

where output_model doesn't have an extension like .model or .pkl

In the past, I tried the similar approach with the exception that I passed in a .id2token attribute under the gensim.corpora.Dictionary object instead of the full gensim.corpora.Dictionary to the id2word parameter when I created the model, and the method loads the model fine back then. I wonder if passing in a corpora.Dictionary will make a difference in the loading output...? Back that time, I was using regular python, but now I am using anaconda. However, all the versions of the packages are the same.

答案1

得分: 2

关于__randomstate_ctor的另一个错误报告(位于https://github.com/numpy/numpy/issues/14210)表明问题可能与NumPy对象的序列化有关。

是否有可能在加载出现问题的配置中使用了比保存时更高版本的NumPy?您可以尝试至少暂时回退到一些旧版本的NumPy(仍然足够适用于您使用的Gensim),看看是否有所帮助。

如果您找到任何可以正常加载的情况,即使在次优配置中,您可能能够将导致问题的任何与random相关的对象设置为null并重新保存,然后在您真正需要的配置中加载效果更好的已保存版本。然后,如果在重新加载后确实需要random相关的对象,可以尝试手动重新构建它们。(我还没有深入研究这一点,但如果您找到任何允许加载但随后不确定如何手动将其设置为null/重新构建的解决方法,我可以更仔细地研究一下。)

英文:

Another report of an error about __randomstate_ctor (at <https://github.com/numpy/numpy/issues/14210>) suggests the problem may be related to numpy object pickling.

Is there a chance that the configuration where your load is failing is using a later version of numpy than when the save occurred? Could you try, at least temporarily, rolling back to some older numpy (that's still sufficient for whatever Gensim you're using) to see if it helps?

If you find any load that works, even in a suboptimal config, you might be able to null-out whatever random-related objects are causing the problem and re-save, then having a saved version that loads better in your truly-desired configuration. Then, if the random-related objects truly needed after reload, it may be possible to manually re-constitute them. (I haven't looked into this yet, but if you find any workaround allowing a load, but then aren't sure what to manually null/rebuild, I could take a closer look.)

huangapple
  • 本文由 发表于 2023年2月18日 08:02:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75490275.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定