英文:
What is the best way to save fastText word vectors in a dataframe as numeric values?
问题
I understand your request. Here is the translated content without code parts:
如何更好地将fastText词向量保存到数据框中,以便进一步计算?
大家好!
我有一个关于fastText词向量的问题,具体来说,我想知道如何将它们保存在我的数据框中作为向量,而不是对象。我希望单词向量的列作为数值列存在,因为我的下一步是计算不同词形之间的平均值。
目前,我使用以下代码将词向量保存到我的数据框中:
full_forms["word_vec"] = full_forms.apply(lambda row: ft.get_word_vector(row["word_sg"]), axis=1)
在获取词向量后,我尝试计算平均值,但不起作用:
full_forms["average"] = full_forms.apply(lambda row: row["word_vec":"word_vec_pl"].mean(axis=0))
其中一个想法是将词向量保存到一个列表中,然后这个列表将成为一个numpy.ndarray。但我不确定这是否是一个好选择。我期望这个数组有300维,因为fastText词向量有300维,但是当我检查arr.ndim属性的维数时,我得到的是1。难道不应该是300吗?
这是我第一次在这里寻求帮助,如果有点混乱,请原谅。
提前感谢你的帮助!祝你有一个美好的一天!
Ana
英文:
How to save fastText word vectors in dataframe better in order to use them for further calculations?
Hello everyone!
I have a question about fastText word vectors, namely, I'd like to know, how to save them in my dataframe as vectors, but not objects. I want the column with word vectors be a numeric value as my next step is to calculate the average between different word forms.
Right now I use the following line to save word vectors into my dataframe:
full_forms["word_vec"] = full_forms.apply(lambda row: ft.get_word_vector(row["word_sg"]), axis=1)
After getting word vectors I try to calculate the average but it does not work:
full_forms["average"] = full_forms.apply(lambda row: row["word_vec":"word_vec_pl"].mean(axis=0))
One of the ideas is to save word vectors to a list, then this list will be numpy.ndarray. But I am not sure, whether it is a good choice. I expect this array to have 300 dimentions as fastText word vectors have 300 dimentions, but, when I check the number of dim with arr.ndim attribute, I get 1. Shouldn't it be 300?
That's me first time asking for help here, so sorry if it is too messy.
Thank you for help in advance!
Have a nice day!
Ana
答案1
得分: 0
For further calculations, usually the best approach is to not move the vectors into a DataFrame
at all - which brings up these sorts of type/size issues, and adds more indirection & data-structure overhead from the DataFrame
's table/cells model.
Rather, leave them as the numpy.ndarray
objects they are – either the individual 300-dimension arrays, or in some cases the giant (number_of_words, vector_size)
matrix used by the FastText model itself to store all the words.
Using numpy
functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.
For example, if word_list
is a Python list of the words whose vectors you want to average:
average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)
英文:
For further calculations, usually the best approach is to not move the vectors into a DataFrame
at all - which brings up these sorts of type/size issues, and adds more indirection & data-structure overhead from the DataFrame
's table/cells model.
Rather, leave them as the numpy.ndarray
objects they are – either the individual 300-dimension arrays, or in some cases the giant (number_of_words, vector_size)
matrix used by the FastText model itself to store all the words.
Using numpy
functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.
For example, if word_list
is a Python list of the words whose vectors you want to average:
average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论