2023年4月13日 20:00:41go评论108阅读模式

英文:

BERTopic model: Should I remove names?

问题

我正在尝试创建一个电影推荐系统，并希望使用主题建模来处理电影描述。目前，我正在探索BERTopic模型。

然而，由于我有电影描述，输出的主题包含很多名字。我希望移除这些名字以增加可解释性，但我想知道这是否会影响BERT模型的性能。

英文:

I am trying to create a movie recommender system, and want to use topic modelling in order to use movie descriptions. Currently, I am exploring the BERTopic model

However, because I have movie descriptions, the output topics uses a lot of names. I want to remove the names in order to increase interpretability, but I was wondering if this will affect the performance of the BERT model.

Sorry, the output is in Dutch, but as you can see there are a lot of names in the topic outputs.

BERTopic模型：我应该移除名字吗？

答案1

得分: 2

以下是翻译好的部分：

通常情况下，除非人物本身在主题表示中被视为重要，否则它通常不会影响最终主题表示的质量。在您的情况下，我会认为删除姓名将有助于创建更直观的主题表示。

但是，在删除姓名时要注意。如果在嵌入步骤之前删除它们，也就是在将文档传递给 BERTopic 之前，可能会影响嵌入的上下文表示。相反，我建议使用 CountVectorizer 来删除这些姓名。您可以将它们传递为停用词，例如。实际上，这意味着这些姓名将在创建簇之后但在提取主题表示之前被删除。

您可以采取的另一步骤是使用 KeyBERTInspired 或 PartOfSpeech 作为您的表示模型，因为它们通常会删除类似姓名的信息。

来源：我是 BERTopic 的作者。

英文:

It generally should not hurt the quality of the resulting topic representations unless the persons themselves are considered important in the topic representations. In your case, I would argue that that removing names would help creating a more intuitive representation of your topics.

It is, however, important when you are removing names. If you remove them before the embedding step, so before passing the documents to BERTopic, then it might hurt the contextual representation of the embeddings. Instead, I would advise using the CountVectorizer to remove these names. You can pass them as stopwords for example. In practice, this means that the names will be removed after creating the clusters but before extracting the topic representations.

Another step you can take is using KeyBERTInspired or PartOfSpeech as your representation model as they typically remove name-like information.

Source: I am the author of BERTopic.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

BERTopic模型：我应该移除名字吗？

问题

答案1

进度条在Python中工作不正常。

你可以在pytest中运行带有包含空格的字符串参数的单参数化测试吗？

IndexError: index 4 is out of bounds for dimension 0 with size 4.

如何获取用于网络爬虫的HTTP标头信息

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。