英文:
BERTopic model: Should I remove names?
问题
我正在尝试创建一个电影推荐系统,并希望使用主题建模来处理电影描述。目前,我正在探索BERTopic模型。
然而,由于我有电影描述,输出的主题包含很多名字。我希望移除这些名字以增加可解释性,但我想知道这是否会影响BERT模型的性能。
英文:
I am trying to create a movie recommender system, and want to use topic modelling in order to use movie descriptions. Currently, I am exploring the BERTopic model
However, because I have movie descriptions, the output topics uses a lot of names. I want to remove the names in order to increase interpretability, but I was wondering if this will affect the performance of the BERT model.
Sorry, the output is in Dutch, but as you can see there are a lot of names in the topic outputs.
答案1
得分: 2
以下是翻译好的部分:
通常情况下,除非人物本身在主题表示中被视为重要,否则它通常不会影响最终主题表示的质量。在您的情况下,我会认为删除姓名将有助于创建更直观的主题表示。
但是,在删除姓名时要注意。如果在嵌入步骤之前删除它们,也就是在将文档传递给 BERTopic 之前,可能会影响嵌入的上下文表示。相反,我建议使用 CountVectorizer 来删除这些姓名。您可以将它们传递为停用词,例如。实际上,这意味着这些姓名将在创建簇之后但在提取主题表示之前被删除。
您可以采取的另一步骤是使用 KeyBERTInspired
或 PartOfSpeech
作为您的表示模型,因为它们通常会删除类似姓名的信息。
来源:我是 BERTopic 的作者。
英文:
It generally should not hurt the quality of the resulting topic representations unless the persons themselves are considered important in the topic representations. In your case, I would argue that that removing names would help creating a more intuitive representation of your topics.
It is, however, important when you are removing names. If you remove them before the embedding step, so before passing the documents to BERTopic, then it might hurt the contextual representation of the embeddings. Instead, I would advise using the CountVectorizer to remove these names. You can pass them as stopwords for example. In practice, this means that the names will be removed after creating the clusters but before extracting the topic representations.
Another step you can take is using KeyBERTInspired
or PartOfSpeech
as your representation model as they typically remove name-like information.
Source: I am the author of BERTopic.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论