2023年3月9日 23:07:21go评论157阅读模式

英文:

Apply python package (spaCy) word list only covering the specific language vocabulary

问题

我需要使用spaCy从文本中过滤掉非核心德语单词。然而，我找不到一个合适的方法或单词列表，只涵盖德语语言的基本词汇。

我尝试过使用spaCy工具nlp(word).has_vector和nlp(word).vector_norm == 0，以及使用诸如list(nlp.vocab.strings)从'de_core_news_sm'或'de_core_news_lg'的单词列表，但它们要么将不相关的词语识别为德语的一部分，要么无法识别基本的德语单词。
我正在寻求关于如何获取或创建一个准确涵盖德语语言核心词汇的单词列表的建议，并且可以与（最好是）spaCy或其他NLP包一起使用。我希望使用通用的、不特定于德语的语言包，这样我可以更轻松地扩展到其他语言。

英文:

I need to filter out non-core German words from a text using spaCy. However, I couldn't find a suitable approach or word list that covers only the essential vocabulary of the German language.

I have tried different approaches using the spacy tools nlp(word).has_vector and nlp(word).vector_norm == 0 and using a list of words like list(nlp.vocab.strings) from 'de_core_news_sm' or 'de_core_news_lg', but they either recognize irrelevant words as part of the German language or fail to recognize basic German words.
I'm looking for recommendations on how to obtain or create a word list that accurately covers only the core vocabulary of the German language, and can be used with (preferably) spaCy or other NLP packages. I would prefer using a universal, not german specific, language package, so that I can extend to other languages as easily.

答案1

得分: 1

你可以尝试使用基于频率的方法。为此，你应该使用一个按照单词在德语书面或口语中使用频率排序的频率列表。这里有一个示例[repo][1]。或者，你可以使用一个大语料库自己创建它。

我可以展示一个使用 **spaCy** 的非常基本的版本：

- 定义一个函数来过滤掉非核心德语单词。该函数应检查一个标记是否在频率列表中。
- 处理你的文本，并将该函数应用于处理后的文本中的每个标记。

```python
import spacy
import pandas as pd
import nltk

nlp = spacy.load("de_core_news_sm")
stemmer = nltk.stem.Cistem()

# 载入德语单词的频率列表
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])

# 定义一个函数来过滤掉非核心德语单词
def is_core_german_word(token):
    return df.at[stemmer.stem(token.text.lower()), 'freq'] > 0

# 处理你的文本
text = "Lass uns ein bisschen Spaß haben!"
doc = nlp(text)

# 过滤掉非核心德语单词
core_german_words = [token.text for token in doc if is_core_german_word(token)]

print(core_german_words)

请注意，结果的质量将取决于你使用的频率列表的质量和覆盖范围。你可能需要结合多种方法，比如使用CEFR级别或词嵌入，以获得一个准确涵盖德语核心词汇的单词列表。

我知道这非常依赖于语言。但如果没有其他答案，我认为这可能会有帮助。


<details>
<summary>英文:</summary>

You can use a frequency-based approach, maybe. For this, you should use a frequency list that ranks words by their frequency of use in written or spoken German. Here is an example [repo][1]. Alternatively, you can create it on your own using a large corpus.

I can show a very basic version using **spaCy**:

- Define a function to filter out non-core German words. The function should check if a token is in the frequency list.
- Process your text and apply the function to each token in the processed text.

```python
import spacy
import pandas as pd
import nltk

nlp = spacy.load(&quot;de_core_news_sm&quot;)
stemmer = nltk.stem.Cistem()

# Load a frequency list of German words
df = pd.read_csv(&#39;~/decow_wordfreq_cistem.csv&#39;, index_col=[&#39;word&#39;])

# Define a function to filter out non-core German words
def is_core_german_word(token):
    return df.at[stemmer.stem(token.text.lower()), &#39;freq&#39;] &gt; 0

# Process your text
text = &quot;Lass uns ein bisschen Spa&#223; haben!&quot;
doc = nlp(text)

# Filter out non-core German words
core_german_words = [token.text for token in doc if is_core_german_word(token)]

print(core_german_words)

Note that the quality of the results will depend on the quality and coverage of the frequency list you use. You may need to combine multiple approaches, such as using the CEFR levels or word embeddings, to obtain a word list that accurately covers only the core vocabulary of the German language.

I am aware that this is very language specific. But I thought it might be helpful if no other answer came up.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python包（spaCy）仅覆盖特定语言词汇的单词列表。

问题

答案1

能否将我的复杂for循环转换为嵌套的列表推导式？

如何修复一个质数三角形

在Linux中使用Python终止应用程序。

Python – Numpy或者Pandas（也可以）广播

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论