BERTopic模型:我应该移除名字吗?

huangapple go评论68阅读模式
英文:

BERTopic model: Should I remove names?

问题

我正在尝试创建一个电影推荐系统,并希望使用主题建模来处理电影描述。目前,我正在探索BERTopic模型

然而,由于我有电影描述,输出的主题包含很多名字。我希望移除这些名字以增加可解释性,但我想知道这是否会影响BERT模型的性能。

英文:

I am trying to create a movie recommender system, and want to use topic modelling in order to use movie descriptions. Currently, I am exploring the BERTopic model

However, because I have movie descriptions, the output topics uses a lot of names. I want to remove the names in order to increase interpretability, but I was wondering if this will affect the performance of the BERT model.

Sorry, the output is in Dutch, but as you can see there are a lot of names in the topic outputs.

BERTopic模型:我应该移除名字吗?

答案1

得分: 2

以下是翻译好的部分:

通常情况下,除非人物本身在主题表示中被视为重要,否则它通常不会影响最终主题表示的质量。在您的情况下,我会认为删除姓名将有助于创建更直观的主题表示。

但是,在删除姓名时要注意。如果在嵌入步骤之前删除它们,也就是在将文档传递给 BERTopic 之前,可能会影响嵌入的上下文表示。相反,我建议使用 CountVectorizer 来删除这些姓名。您可以将它们传递为停用词,例如。实际上,这意味着这些姓名将在创建簇之后但在提取主题表示之前被删除。

您可以采取的另一步骤是使用 KeyBERTInspiredPartOfSpeech 作为您的表示模型,因为它们通常会删除类似姓名的信息。

来源:我是 BERTopic 的作者。

英文:

It generally should not hurt the quality of the resulting topic representations unless the persons themselves are considered important in the topic representations. In your case, I would argue that that removing names would help creating a more intuitive representation of your topics.

It is, however, important when you are removing names. If you remove them before the embedding step, so before passing the documents to BERTopic, then it might hurt the contextual representation of the embeddings. Instead, I would advise using the CountVectorizer to remove these names. You can pass them as stopwords for example. In practice, this means that the names will be removed after creating the clusters but before extracting the topic representations.

Another step you can take is using KeyBERTInspired or PartOfSpeech as your representation model as they typically remove name-like information.

Source: I am the author of BERTopic.

huangapple
  • 本文由 发表于 2023年4月13日 20:00:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005162.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定