如何在Python中使用Rake算法设置提取的关键词数量?

huangapple go评论77阅读模式
英文:

How do I set the number of extracted words with the Rake algorithm in python?

问题

当使用Rake提取关键词时,该算法会创建候选短语并根据它们的得分对它们进行排名,然后返回得分至少达到一定阈值的短语。

如何设置这个最小得分,或设置提取的关键词的最小数量,或者至少获取所有候选短语?

from rake-nltk import rake 

r = Rake()

# 提取文本中的关键词。
r.extract_keywords_from_text(text)

keywords = r.get_ranked_phrases()
print(keywords)

这是标准的过程,但我想知道是否有不同的函数或参数可以设置,以获取不仅仅是最显著的关键词,而是所有关键词,或至少更多数量的关键词。

英文:

When extracting keywords with Rake, the algorithm creates its candidate phrases and ranks them based on their score and returns the phrases with at least a certain score.

How can I set this minimum score, or set the minimum number of extracted keywords, or at least get all the candidate phrases ?

from rake-nltk import rake 

r = Rake()

# Extraction given the text.
r.extract_keywords_from_text(text)

keywords=r.get_ranked_phrases()
print(keywords)

this is the standard procedure, but I would like to know if there is a different function, or a parameter I can set to get not just the most significant keywords, but all of them or at least a bigger amount of them

答案1

得分: 1

以下是您要求的代码部分的翻译:

def extract_keywords_from_text(self, text: str):
    sentences: List[Sentence] = self._tokenize_text_to_sentences(text)
    self.extract_keywords_from_sentences(sentences)

def extract_keywords_from_sentences(self, sentences: List[Sentence]):
    phrase_list: List[Phrase] = self._generate_phrases(sentences)
    self._build_frequency_dist(phrase_list)
    self._build_word_co_occurance_graph(phrase_list)
    self._build_ranklist(phrase_list)

Rake算法的第一步在.extract_keywords_from_sentences()函数中生成可能的短语,使用._generate_phrases()函数,即

def _generate_phrases(self, sentences: List[Sentence]) -> List[Phrase]:
    phrase_list: List[Phrase] = []
    # 从句子中创建候选短语。
    for sentence in sentences:
        word_list: List[Word] = [word.lower() for word in self._tokenize_sentence_to_words(sentence)]
        phrase_list.extend(self._get_phrase_list_from_words(word_list))

    # 基于用户选择是否包括重复短语
    # 我们计算短语列表并返回它。如果不包括重复
    # 短语,我们只包括短语的第一次出现,然后丢弃
    # 其余的。
    if not self.include_repeated_phrases:
        unique_phrase_tracker: Set[Phrase] = set()
        non_repeated_phrase_list: List[Phrase] = []
        for phrase in phrase_list:
            if phrase not in unique_phrase_tracker:
                unique_phrase_tracker.add(phrase)
                non_repeated_phrase_list.append(phrase)
        return non_repeated_phrase_list

    return phrase_list

这个函数调用._get_phrase_list_from_words()函数来提取短语,注意文档字符串:

从构成句子的单词列表中创建候选短语的方法,删除停用词和标点符号,并将剩余的单词分组成短语。仅考虑给定长度范围内的短语(包括限制)来构建共现矩阵。

Ex:

Sentence: Red apples, are good in flavour.

List of words: ['red', 'apples', ',', 'are', 'good', 'in', 'flavour']

List after dropping punctuations and stopwords.

List of words: ['red', 'apples', *, *, good, *, 'flavour']
List of phrases: [('red', 'apples'), ('good',), ('flavour',)]

短结论

由于第一步已经创建了“候选短语”,因此可以提取使用rake-nltk的短语的最大上限数量由您的数据集和._get_phrase_list_from_words()中的规则决定。

TL;DR

虽然您不能增加短语的最大数量,但可以使用以下方法访问完整列表:

from rake_nltk import Rake

from nltk.corpus import reuters

text = '\n'.join([' '.join(s) for s in reuters.sents()[:1000]])

r = Rake()
r.extract_keywords_from_text(text)
print(r.rank_list)

注意:

  • rank_list返回一个元组的列表,元组中的第一个项目倾向于较长的“短语”具有更高的分数,第二个项目是“短语”本身。

  • 它可以用于通过输出的长度重新标准化rake得分,例如

from rake_nltk import Rake
from collections import Counter

from nltk.corpus import reuters

text = '\n'.join([' '.join(s) for s in reuters.sents()[:1000]])

r = Rake()
r.extract_keywords_from_text(text)

normalized_rank_phrases = Counter()

for score, phrase in r.rank_list:
    if len(phrase.split()) > 1:
        normalized_rank_phrases[phrase] = score/len(phrase.split())**2  # 通过分割长度来惩罚

normalized_rank_phrases.most_common()

[out]:

from rake_nltk import Rake
from collections import Counter

from nltk.corpus import reuters

text = '\n'.join([' '.join(s) for s in reuters.sents()[:1000]])

r = Rake()
r.extract_keywords_from_text(text)

normalized_rank_phrases = Counter()

for score, phrase in r.rank_list:
    if len(phrase.split()) > 1:
        normalized_rank_phrases[phrase] = score/len(phrase.split())**2  # 通过分割长度来惩罚

normalized_rank_phrases.most_common()

[out]:

[('qtrly div', 7.318181818181818),
 ('73 cts', 7.050699300699301),
 ('33 cts', 5.225699300699301),
 ('independent chairman', 4.845833333333333),
 ('60 cts', 4.741875771287536),
 ('42 cts', 4.550699300699301),
 ('30 cts', 4.498615967365968),
 ('nine mths', 4.453571428571428),
 ('73 pct', 4.4229452054794525),
 ('net profit', 4.156236497191416),
 ('fleet financial', 4.125),
 ('net loss', 4.056598347798808),
 ('hikes dividend', 4.015151515151516),
 ('65 cts', 3.988199300699301),
 ('declares one', 3.9166666666666665),
 ('50 cts', 3.868881118881119),
... ]
英文:

Lets try to walk through the code https://csurfer.github.io/rake-nltk/_build/html/_modules/rake_nltk/rake.html:

We start with .extract_keywords_from_text() function that is a wrapper over .extract_keywords_from_sentences()

    def extract_keywords_from_text(self, text: str):
        sentences: List[Sentence] = self._tokenize_text_to_sentences(text)
        self.extract_keywords_from_sentences(sentences)


    def extract_keywords_from_sentences(self, sentences: List[Sentence]):
        phrase_list: List[Phrase] = self._generate_phrases(sentences)
        self._build_frequency_dist(phrase_list)
        self._build_word_co_occurance_graph(phrase_list)
        self._build_ranklist(phrase_list)

The first step of the Rake algorithm in .extract_keywords_from_sentences() is to generate possible phrases using the ._generate_phrases() function, i.e.


    def _generate_phrases(self, sentences: List[Sentence]) -> List[Phrase]:
        phrase_list: List[Phrase] = []
        # Create contender phrases from sentences.
        for sentence in sentences:
            word_list: List[Word] = [word.lower() for word in self._tokenize_sentence_to_words(sentence)]
            phrase_list.extend(self._get_phrase_list_from_words(word_list))

        # Based on user's choice to include or not include repeated phrases
        # we compute the phrase list and return it. If not including repeated
        # phrases, we only include the first occurance of the phrase and drop
        # the rest.
        if not self.include_repeated_phrases:
            unique_phrase_tracker: Set[Phrase] = set()
            non_repeated_phrase_list: List[Phrase] = []
            for phrase in phrase_list:
                if phrase not in unique_phrase_tracker:
                    unique_phrase_tracker.add(phrase)
                    non_repeated_phrase_list.append(phrase)
            return non_repeated_phrase_list

        return phrase_list

which calls the ._get_phrase_list_from_words() function to extract the phrases, note the docstring:

> Method to create contender phrases from the list of words that form a
> sentence by dropping stopwords and punctuations and grouping the left
> words into phrases. Only phrases in the given length range (both
> limits inclusive) would be considered to build co-occurrence matrix.

Ex:
 
Sentence: Red apples, are good in flavour. 

List of words: ['red', 'apples', ",", 'are', 'good', 'in', 'flavour'] 

List after dropping punctuations and stopwords. 

List of words: ['red', 'apples', *, *, good, *, 'flavour'] 
List of phrases: [('red', 'apples'), ('good',), ('flavour',)]

Short conclusion

There is some sort of upperbound to the number of phrases, since the first step already created "contender phrases" and the upperbound number of phrases you can extract using rake-nltk is determined by your dataset + rules in ._get_phrase_list_from_words().

TL;DR

While you cannot add to the max no. of phrases, you can access the full list with:

from rake_nltk import Rake

from nltk.corpus import reuters

text = '\n'.join([' '.join(s) for s in reuters.sents()[:1000]])


r = Rake()
r.extract_keywords_from_text(text)
print(r.rank_list)

Note:

  • The rank_list returns a list of tuples where the 1st item in the tuple is the rake score that skews towards longer "phrases" have higher score, and the 2nd item is the "phrase" itself.

  • It might be used to re-normalize the rake score by the length of the output, e.g.

from rake_nltk import Rake
from collections import Counter

from nltk.corpus import reuters

text = '\n'.join([' '.join(s) for s in reuters.sents()[:1000]])


r = Rake()
r.extract_keywords_from_text(text)

normalized_rank_phrases = Counter()

for score, phrase in r.rank_list:
    if len(phrase.split()) > 1:
        normalized_rank_phrases[phrase] = score/len(phrase.split())**2  # Penalize by dividing length
    
normalized_rank_phrases.most_common()

[out]:


from rake_nltk import Rake
from collections import Counter

from nltk.corpus import reuters

text = '\n'.join([' '.join(s) for s in reuters.sents()[:1000]])


r = Rake()
r.extract_keywords_from_text(text)

normalized_rank_phrases = Counter()

for score, phrase in r.rank_list:
    if len(phrase.split()) > 1:
        normalized_rank_phrases[phrase] = score/len(phrase.split())**2  # Penalize by dividing length
    
normalized_rank_phrases.most_common()

[out]:

[('qtrly div', 7.318181818181818),
 ('73 cts', 7.050699300699301),
 ('33 cts', 5.225699300699301),
 ('independent chairman', 4.845833333333333),
 ('60 cts', 4.741875771287536),
 ('42 cts', 4.550699300699301),
 ('30 cts', 4.498615967365968),
 ('nine mths', 4.453571428571428),
 ('73 pct', 4.4229452054794525),
 ('net profit', 4.156236497191416),
 ('fleet financial', 4.125),
 ('net loss', 4.056598347798808),
 ('hikes dividend', 4.015151515151516),
 ('65 cts', 3.988199300699301),
 ('declares one', 3.9166666666666665),
 ('50 cts', 3.868881118881119),
... ]

huangapple
  • 本文由 发表于 2023年2月24日 06:17:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75550887.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定