使用Python包(spaCy)仅覆盖特定语言词汇的单词列表。

huangapple go评论79阅读模式
英文:

Apply python package (spaCy) word list only covering the specific language vocabulary

问题

我需要使用spaCy从文本中过滤掉非核心德语单词。然而,我找不到一个合适的方法或单词列表,只涵盖德语语言的基本词汇。

我尝试过使用spaCy工具nlp(word).has_vectornlp(word).vector_norm == 0,以及使用诸如list(nlp.vocab.strings)从'de_core_news_sm'或'de_core_news_lg'的单词列表,但它们要么将不相关的词语识别为德语的一部分,要么无法识别基本的德语单词。
我正在寻求关于如何获取或创建一个准确涵盖德语语言核心词汇的单词列表的建议,并且可以与(最好是)spaCy或其他NLP包一起使用。我希望使用通用的、不特定于德语的语言包,这样我可以更轻松地扩展到其他语言。

英文:

I need to filter out non-core German words from a text using spaCy. However, I couldn't find a suitable approach or word list that covers only the essential vocabulary of the German language.

I have tried different approaches using the spacy tools nlp(word).has_vector and nlp(word).vector_norm == 0 and using a list of words like list(nlp.vocab.strings) from 'de_core_news_sm' or 'de_core_news_lg', but they either recognize irrelevant words as part of the German language or fail to recognize basic German words.
I'm looking for recommendations on how to obtain or create a word list that accurately covers only the core vocabulary of the German language, and can be used with (preferably) spaCy or other NLP packages. I would prefer using a universal, not german specific, language package, so that I can extend to other languages as easily.

答案1

得分: 1

你可以尝试使用基于频率的方法为此你应该使用一个按照单词在德语书面或口语中使用频率排序的频率列表这里有一个示例[repo][1]或者你可以使用一个大语料库自己创建它

我可以展示一个使用 **spaCy** 的非常基本的版本

- 定义一个函数来过滤掉非核心德语单词该函数应检查一个标记是否在频率列表中
- 处理你的文本并将该函数应用于处理后的文本中的每个标记

```python
import spacy
import pandas as pd
import nltk

nlp = spacy.load("de_core_news_sm")
stemmer = nltk.stem.Cistem()

# 载入德语单词的频率列表
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])

# 定义一个函数来过滤掉非核心德语单词
def is_core_german_word(token):
    return df.at[stemmer.stem(token.text.lower()), 'freq'] > 0

# 处理你的文本
text = "Lass uns ein bisschen Spaß haben!"
doc = nlp(text)

# 过滤掉非核心德语单词
core_german_words = [token.text for token in doc if is_core_german_word(token)]

print(core_german_words)

请注意,结果的质量将取决于你使用的频率列表的质量和覆盖范围。你可能需要结合多种方法,比如使用CEFR级别或词嵌入,以获得一个准确涵盖德语核心词汇的单词列表。

我知道这非常依赖于语言。但如果没有其他答案,我认为这可能会有帮助。


<details>
<summary>英文:</summary>

You can use a frequency-based approach, maybe. For this, you should use a frequency list that ranks words by their frequency of use in written or spoken German. Here is an example [repo][1]. Alternatively, you can create it on your own using a large corpus.

I can show a very basic version using **spaCy**:

- Define a function to filter out non-core German words. The function should check if a token is in the frequency list.
- Process your text and apply the function to each token in the processed text.

```python
import spacy
import pandas as pd
import nltk

nlp = spacy.load(&quot;de_core_news_sm&quot;)
stemmer = nltk.stem.Cistem()

# Load a frequency list of German words
df = pd.read_csv(&#39;~/decow_wordfreq_cistem.csv&#39;, index_col=[&#39;word&#39;])

# Define a function to filter out non-core German words
def is_core_german_word(token):
    return df.at[stemmer.stem(token.text.lower()), &#39;freq&#39;] &gt; 0

# Process your text
text = &quot;Lass uns ein bisschen Spa&#223; haben!&quot;
doc = nlp(text)

# Filter out non-core German words
core_german_words = [token.text for token in doc if is_core_german_word(token)]

print(core_german_words)

Note that the quality of the results will depend on the quality and coverage of the frequency list you use. You may need to combine multiple approaches, such as using the CEFR levels or word embeddings, to obtain a word list that accurately covers only the core vocabulary of the German language.

I am aware that this is very language specific. But I thought it might be helpful if no other answer came up.

huangapple
  • 本文由 发表于 2023年3月9日 23:07:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686448.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定