2023年5月21日 20:41:57go评论75阅读模式

英文:

What is the best way to save fastText word vectors in a dataframe as numeric values?

问题

I understand your request. Here is the translated content without code parts:

如何更好地将fastText词向量保存到数据框中，以便进一步计算？

大家好！

我有一个关于fastText词向量的问题，具体来说，我想知道如何将它们保存在我的数据框中作为向量，而不是对象。我希望单词向量的列作为数值列存在，因为我的下一步是计算不同词形之间的平均值。

目前，我使用以下代码将词向量保存到我的数据框中：

full_forms["word_vec"] = full_forms.apply(lambda row: ft.get_word_vector(row["word_sg"]), axis=1)

在获取词向量后，我尝试计算平均值，但不起作用：

full_forms["average"] = full_forms.apply(lambda row: row["word_vec":"word_vec_pl"].mean(axis=0))

其中一个想法是将词向量保存到一个列表中，然后这个列表将成为一个numpy.ndarray。但我不确定这是否是一个好选择。我期望这个数组有300维，因为fastText词向量有300维，但是当我检查arr.ndim属性的维数时，我得到的是1。难道不应该是300吗？

这是我第一次在这里寻求帮助，如果有点混乱，请原谅。

提前感谢你的帮助！祝你有一个美好的一天！

Ana

英文:

How to save fastText word vectors in dataframe better in order to use them for further calculations?

Hello everyone!

I have a question about fastText word vectors, namely, I'd like to know, how to save them in my dataframe as vectors, but not objects. I want the column with word vectors be a numeric value as my next step is to calculate the average between different word forms.

Right now I use the following line to save word vectors into my dataframe:

full_forms[&quot;word_vec&quot;] = full_forms.apply(lambda row: ft.get_word_vector(row[&quot;word_sg&quot;]), axis=1)

After getting word vectors I try to calculate the average but it does not work:

full_forms[&quot;average&quot;] = full_forms.apply(lambda row: row[&quot;word_vec&quot;:&quot;word_vec_pl&quot;].mean(axis=0))

One of the ideas is to save word vectors to a list, then this list will be numpy.ndarray. But I am not sure, whether it is a good choice. I expect this array to have 300 dimentions as fastText word vectors have 300 dimentions, but, when I check the number of dim with arr.ndim attribute, I get 1. Shouldn't it be 300?

That's me first time asking for help here, so sorry if it is too messy.
Thank you for help in advance!
Have a nice day!
Ana

答案1

得分: 0

For further calculations, usually the best approach is to not move the vectors into a DataFrame at all - which brings up these sorts of type/size issues, and adds more indirection & data-structure overhead from the DataFrame's table/cells model.

Rather, leave them as the numpy.ndarray objects they are – either the individual 300-dimension arrays, or in some cases the giant (number_of_words, vector_size) matrix used by the FastText model itself to store all the words.

Using numpy functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.

For example, if word_list is a Python list of the words whose vectors you want to average:

average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)

英文:

Using numpy functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.

For example, if word_list is a Python list of the words whose vectors you want to average:

average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将fastText单词向量保存为数据框中的数值的最佳方法是什么？

问题

答案1

获取特定列中的最后一项在tkinter python中的方法是什么？

如何使装饰器在函数体中缩小类型？

无法安装pip mysql-connector。

如何在DjangoRestFramework的ModelViewSet中声明user_id。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论