将fastText单词向量保存为数据框中的数值的最佳方法是什么?

huangapple go评论68阅读模式
英文:

What is the best way to save fastText word vectors in a dataframe as numeric values?

问题

I understand your request. Here is the translated content without code parts:

如何更好地将fastText词向量保存到数据框中,以便进一步计算?

大家好!

我有一个关于fastText词向量的问题,具体来说,我想知道如何将它们保存在我的数据框中作为向量,而不是对象。我希望单词向量的列作为数值列存在,因为我的下一步是计算不同词形之间的平均值。

目前,我使用以下代码将词向量保存到我的数据框中:

full_forms["word_vec"] = full_forms.apply(lambda row: ft.get_word_vector(row["word_sg"]), axis=1)

在获取词向量后,我尝试计算平均值,但不起作用:

full_forms["average"] = full_forms.apply(lambda row: row["word_vec":"word_vec_pl"].mean(axis=0))

其中一个想法是将词向量保存到一个列表中,然后这个列表将成为一个numpy.ndarray。但我不确定这是否是一个好选择。我期望这个数组有300维,因为fastText词向量有300维,但是当我检查arr.ndim属性的维数时,我得到的是1。难道不应该是300吗?

这是我第一次在这里寻求帮助,如果有点混乱,请原谅。

提前感谢你的帮助!祝你有一个美好的一天!

Ana

英文:

How to save fastText word vectors in dataframe better in order to use them for further calculations?

Hello everyone!

I have a question about fastText word vectors, namely, I'd like to know, how to save them in my dataframe as vectors, but not objects. I want the column with word vectors be a numeric value as my next step is to calculate the average between different word forms.

Right now I use the following line to save word vectors into my dataframe:

full_forms["word_vec"] = full_forms.apply(lambda row: ft.get_word_vector(row["word_sg"]), axis=1)

After getting word vectors I try to calculate the average but it does not work:

full_forms["average"] = full_forms.apply(lambda row: row["word_vec":"word_vec_pl"].mean(axis=0))

One of the ideas is to save word vectors to a list, then this list will be numpy.ndarray. But I am not sure, whether it is a good choice. I expect this array to have 300 dimentions as fastText word vectors have 300 dimentions, but, when I check the number of dim with arr.ndim attribute, I get 1. Shouldn't it be 300?

That's me first time asking for help here, so sorry if it is too messy.
Thank you for help in advance!
Have a nice day!
Ana

答案1

得分: 0

For further calculations, usually the best approach is to not move the vectors into a DataFrame at all - which brings up these sorts of type/size issues, and adds more indirection & data-structure overhead from the DataFrame's table/cells model.

Rather, leave them as the numpy.ndarray objects they are – either the individual 300-dimension arrays, or in some cases the giant (number_of_words, vector_size) matrix used by the FastText model itself to store all the words.

Using numpy functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.

For example, if word_list is a Python list of the words whose vectors you want to average:

average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)
英文:

For further calculations, usually the best approach is to not move the vectors into a DataFrame at all - which brings up these sorts of type/size issues, and adds more indirection & data-structure overhead from the DataFrame's table/cells model.

Rather, leave them as the numpy.ndarray objects they are – either the individual 300-dimension arrays, or in some cases the giant (number_of_words, vector_size) matrix used by the FastText model itself to store all the words.

Using numpy functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.

For example, if word_list is a Python list of the words whose vectors you want to average:

average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)

huangapple
  • 本文由 发表于 2023年5月21日 20:41:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76299956.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定