英文:
Apply python package (spaCy) word list only covering the specific language vocabulary
问题
我需要使用spaCy从文本中过滤掉非核心德语单词。然而,我找不到一个合适的方法或单词列表,只涵盖德语语言的基本词汇。
我尝试过使用spaCy工具nlp(word).has_vector
和nlp(word).vector_norm == 0
,以及使用诸如list(nlp.vocab.strings)
从'de_core_news_sm'或'de_core_news_lg'的单词列表,但它们要么将不相关的词语识别为德语的一部分,要么无法识别基本的德语单词。
我正在寻求关于如何获取或创建一个准确涵盖德语语言核心词汇的单词列表的建议,并且可以与(最好是)spaCy或其他NLP包一起使用。我希望使用通用的、不特定于德语的语言包,这样我可以更轻松地扩展到其他语言。
英文:
I need to filter out non-core German words from a text using spaCy. However, I couldn't find a suitable approach or word list that covers only the essential vocabulary of the German language.
I have tried different approaches using the spacy tools nlp(word).has_vector
and nlp(word).vector_norm == 0
and using a list of words like list(nlp.vocab.strings)
from 'de_core_news_sm' or 'de_core_news_lg', but they either recognize irrelevant words as part of the German language or fail to recognize basic German words.
I'm looking for recommendations on how to obtain or create a word list that accurately covers only the core vocabulary of the German language, and can be used with (preferably) spaCy or other NLP packages. I would prefer using a universal, not german specific, language package, so that I can extend to other languages as easily.
答案1
得分: 1
你可以尝试使用基于频率的方法。为此,你应该使用一个按照单词在德语书面或口语中使用频率排序的频率列表。这里有一个示例[repo][1]。或者,你可以使用一个大语料库自己创建它。
我可以展示一个使用 **spaCy** 的非常基本的版本:
- 定义一个函数来过滤掉非核心德语单词。该函数应检查一个标记是否在频率列表中。
- 处理你的文本,并将该函数应用于处理后的文本中的每个标记。
```python
import spacy
import pandas as pd
import nltk
nlp = spacy.load("de_core_news_sm")
stemmer = nltk.stem.Cistem()
# 载入德语单词的频率列表
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])
# 定义一个函数来过滤掉非核心德语单词
def is_core_german_word(token):
return df.at[stemmer.stem(token.text.lower()), 'freq'] > 0
# 处理你的文本
text = "Lass uns ein bisschen Spaß haben!"
doc = nlp(text)
# 过滤掉非核心德语单词
core_german_words = [token.text for token in doc if is_core_german_word(token)]
print(core_german_words)
请注意,结果的质量将取决于你使用的频率列表的质量和覆盖范围。你可能需要结合多种方法,比如使用CEFR级别或词嵌入,以获得一个准确涵盖德语核心词汇的单词列表。
我知道这非常依赖于语言。但如果没有其他答案,我认为这可能会有帮助。
<details>
<summary>英文:</summary>
You can use a frequency-based approach, maybe. For this, you should use a frequency list that ranks words by their frequency of use in written or spoken German. Here is an example [repo][1]. Alternatively, you can create it on your own using a large corpus.
I can show a very basic version using **spaCy**:
- Define a function to filter out non-core German words. The function should check if a token is in the frequency list.
- Process your text and apply the function to each token in the processed text.
```python
import spacy
import pandas as pd
import nltk
nlp = spacy.load("de_core_news_sm")
stemmer = nltk.stem.Cistem()
# Load a frequency list of German words
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])
# Define a function to filter out non-core German words
def is_core_german_word(token):
return df.at[stemmer.stem(token.text.lower()), 'freq'] > 0
# Process your text
text = "Lass uns ein bisschen Spaß haben!"
doc = nlp(text)
# Filter out non-core German words
core_german_words = [token.text for token in doc if is_core_german_word(token)]
print(core_german_words)
Note that the quality of the results will depend on the quality and coverage of the frequency list you use. You may need to combine multiple approaches, such as using the CEFR levels or word embeddings, to obtain a word list that accurately covers only the core vocabulary of the German language.
I am aware that this is very language specific. But I thought it might be helpful if no other answer came up.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论