Python – 在长文本中查找短语

huangapple go评论88阅读模式
英文:

Python - Find a phrase inside a long text

问题

什么是在Python中查找较长文本中短语的最有效方法?
我想要做的是找到完整的短语,但如果找不到,将其拆分成较小的部分并尝试找到它们,直到单词级别。

例如,我有一段文本:

段落是论文的构建模块。许多学生以长度来定义段落:一个段落至少包括五个句子,一个段落有半页长等等... 有许多头脑风暴的技巧;无论您选择哪一种,段落发展的这个阶段都不能被跳过。

我想找到短语:there is a group of students

完整的短语无法被找到,但它的较小部分可以。因此,它将找到:

  • there
  • is a group of
  • students

这是否可能?如果是的话,最有效的算法是什么?

我尝试了一些递归函数,但它们不能找到短语的这些子部分,要么它们找到整个短语,要么只找到单词。

英文:

What's the most efficient way of finding a phrase inside a longer text with python?
What I would like to do is finding the complete phrase, but if it's not found, split it into smaller parts and try to find them, down to single words.

For example, I have a text:

> Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc...There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped.

I want to find the phrase: there is a group of students

The entire phrase as it is will not be found, but its smaller parts yes. So it will find:

  • there
  • is a group of
  • students

Is this even possible? if so, what's the most efficient algorithm to achieve so?

I tried with some recursive functions but they are not able to find these sub-parts of the phrase, either they find the entire phrase or they just find the single words.

答案1

得分: 3

如果您想要一个强大的方法,可以在单词级别上工作,同时也可以捕捉例如 "…That" 与 "that" 的差异,我建议使用NLTK进行基本的自然语言处理。特别是在处理小到中等规模的数据集时。

from nltk import ngrams, word_tokenize

text = "..."  # 您的文本
query = "there is a group of students"

def preprocess(raw):
    return [token.lower() for token in word_tokenize(raw)]
    
def extract_ngrams(tokens, min_n, max_n):
    return set(ngram for n in range(min_n, max_n + 1) for ngram in ngrams(tokens, n))

min_n = 1
max_n = len(query)

text_ngrams = extract_ngrams(preprocess(text), min_n, max_n)
query_ngrams = extract_ngrams(preprocess(query), min_n, max_n)

print(text_ngrams & query_ngrams)

输出结果:

{('a',),
 ('a', 'group'),
 ('a', 'group', 'of'),
 ('group',),
 ('group', 'of'),
 ('is',),
 ('is', 'a'),
 ('is', 'a', 'group'),
 ('is', 'a', 'group', 'of'),
 ('of',),
 ('students',),
 ('there',)}
英文:

If you want a robust approach that works on the word level but can also capture, e.g., "...That" vs. "that", I'd recommend some basic NLP with NLTK.
That is if you're working with a small to medium sized data set.

from nltk import ngrams, word_tokenize

text = "..."  # your text
query = "there is a group of students"

def preprocess(raw):
    return [token.lower() for token in word_tokenize(raw)]
    
def extract_ngrams(tokens, min_n, max_n):
    return set(ngram for n in range(min_n, max_n + 1) for ngram in ngrams(tokens, n))


min_n = 1
max_n = len(query)

text_ngrams = extract_ngrams(preprocess(text), min_n, max_n)
query_ngrams = extract_ngrams(preprocess(query), min_n, max_n)

print(text_ngrams & query_ngrams)

Output:

{('a',),
 ('a', 'group'),
 ('a', 'group', 'of'),
 ('group',),
 ('group', 'of'),
 ('is',),
 ('is', 'a'),
 ('is', 'a', 'group'),
 ('is', 'a', 'group', 'of'),
 ('of',),
 ('students',),
 ('there',)}

答案2

得分: 1

最简单的方法是生成要查找的短语的所有可能子集,然后只需使用 if phrase_slice in paragraph 来检查文本是否包含它们。

要获取子集,您可以使用双循环 - 第一个确定要包括的短语中的单词数量,第二个偏移单词。一个示例可能如下所示:

text = "段落是论文的构建块。许多学生根据长度来定义段落:段落至少包含五个句子,段落有半页那么长等等...有许多头脑风暴的技巧;无论您选择哪种,段落发展的这个阶段都不能被跳过。"
phrase = ["有", "学生", "的", "一", "组"]

for i in range(len(phrase)):
    n_words = len(phrase) - i
    for j in range(len(phrase)-i):
        phrase_slice = phrase[j:n_words+j]
        if " ".join(phrase_slice) in text:
            # 做一些操作

请注意,我已将代码中的HTML编码字符(")替换为普通引号以进行翻译。

英文:

The simplest approach would be by generating all possible subsets of the phrase that you want to find and simply check if the text contains them using if phrase_slice in paragraph.

To get the subsets you can use a double loop - first determines how many words from phrase to include and second offsets the words. An example would be something like that:

text = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc...There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."
phrase = ["there", "is", "a", "group", "of", "students"]

for i in range(len(phrase)):
    n_words = len(phrase) - i
    for j in range(len(phrase)-i):
        phrase_slice = phrase[j:n_words+j]
        if " ".join(phrase_slice) in text:
            # Do stuff

答案3

得分: 0

以下是翻译好的代码部分:

from itertools import combinations

# 载入段落和要搜索的短语
paragraph = "段落是论文的构建块。许多学生根据长度定义段落:段落至少包含五句话,段落半页长等等...有许多头脑风暴的技巧;无论你选择哪一种,段落发展的这个阶段都不能跳过。"

phrase = "有一群学生"

phrase_words = phrase.split()

# 生成短语中单词的所有可能组合
phrase_sections = []
for i in range(1, len(phrase_words)):
    for combination in combinations(phrase_words, i):
        phrase_sections.append(''.join(combination).replace(',', ''))

# 在段落中搜索短语(在短语中搜索是保持单词顺序的一种快速而简单的方式)
for section in phrase_sections:
    if (section in phrase) & (section in paragraph):
        print(section)

这是您提供的代码的翻译部分。

英文:

Would something like this work?

from itertools import combinations

#Load up paragraph and phrase to be searched
paragraph = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc...There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."

phrase = "there is a group of students"

phrase_words = phrase.split()

#Generate all possible iterations of the words in the phrase
phrase_sections = []
for i in range(1,len(phrase_words)):
    for combination in combinations(phrase_words, i):
        phrase_sections.append(', '.join(combination).replace(',', ''))

#Search for phrase in paragraph (searching in phrase is a quick and dirty way to maintain order of words)
for section in phrase_sections:
    if (section in phrase) & (section in paragraph):
        print(section)

This outputs:

is
a
group
of
students
is a
a group
group of
is a group
a group of
is a group of

答案4

得分: 0

TEXT = "段落是论文的构建块。许多学生根据长度来定义段落:段落至少包括五句话,段落半页长,等等... 有许多头脑风暴的技巧;无论你选择哪一种,段落发展的这个阶段都不能跳过。"

TO_FIND = "有一组学生"

dict = {}

def preprocess_text():
TEXT.lower()
TEXT.replace(".", "")
TEXT.replace(",", "")
TEXT.replace(";", "")
TEXT.replace(":", "")
TEXT.replace("!", "")
TEXT.replace("?", "")
TEXT.replace("...", "")

def find_groups(occurrences):
groups = []
for i in range(len(occurrences) - 1):
group = []
while occurrences[i][0] + 1 == occurrences[i + 1][0]:
group.append(occurrences[i])
i += 1
if i == len(occurrences) - 1:
if group is not None:
group.append(occurrences[i])
break
if len(group) > 0:
groups.append(group)
return groups

def make_dict():
idx = 0
for word in TEXT.lower().split():
dict[idx] = word
idx += 1

def find_words():
occurrences = [(k, v) for k, v in dict.items() if v in TO_FIND.split()]
return occurrences

if name == "main":
make_dict()
occurrences = find_words()
groups = find_groups(occurrences)
solutions = []
for group in groups:
tmp = []
for elem in group:
tmp.append(elem[1])
tmp = " ".join(tmp)
solutions.append(tmp)
for occurrence in occurrences:
if occurrence[1] not in solutions:
solutions.append(occurrence[1])
for solution in solutions:
print(solution)

英文:
TEXT = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc... There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."

TO_FIND = "there is a group of students"

dict = {}

def preprocess_text():
    TEXT.lower()
    TEXT.replace(".", "")
    TEXT.replace(",", "")
    TEXT.replace(";", "")
    TEXT.replace(":", "")
    TEXT.replace("!", "")
    TEXT.replace("?", "")
    TEXT.replace("...", "")

def find_groups(occurences):
    groups = []
    for i in range(len(occurences) - 1):
        group = []
        while occurences[i][0] + 1 == occurences[i + 1][0]:
            group.append(occurences[i])
            i += 1
            if i == len(occurences) - 1:
                if group is not None:
                    group.append(occurences[i])
                break
        if len(group) > 0:
            groups.append(group)
    return groups

def make_dict():
    idx = 0
    for word in TEXT.lower().split():
        dict[idx] = word
        idx += 1

def find_words():
    occurences = [(k, v) for k, v in dict.items() if v in TO_FIND.split()]
    return occurences

if __name__ == "__main__":
    make_dict()
    occurences = find_words()
    groups = find_groups(occurences)
    solutions = []
    for group in groups:
        tmp = []
        for elem in group:
            tmp.append(elem[1])
        tmp = " ".join(tmp)
        solutions.append(tmp)
    for occurence in occurences:
        if occurence[1] not in solutions:
            solutions.append(occurence[1])
    for solution in solutions:
        print(solution)

The code is a bit complex but it works nicely, 54ms execution time on my machine.

The code first slices the input text and gets all the words that are in the text to find. It then tries to recombine the adjacent words into groups, finds the remaining words that have not been grouped together and prints everything.

I hope that can help !

huangapple
  • 本文由 发表于 2023年7月31日 20:24:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76803595.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定