统计句子中的单词,考虑否定词。

huangapple go评论87阅读模式
英文:

Count words in a sentence controlling for negations

问题

我正在尝试统计句子中某些词汇出现的次数,同时控制否定词。在下面的示例中,我编写了一个非常基本的代码,用于计算"txt"中出现"w"的次数。然而,我未能控制"don't"和/或"not"等否定词。

代码应该只报告找到"apple"的次数,而不是4。因此,我想添加:如果在"w"中的单词前后n个单词中存在否定词,那么不计数,否则计数。

注:这里的否定词是"don't"和"not"之类的词。

有人能帮助我吗?

谢谢您的帮助!

英文:

I am trying to count the number of times some words occur in a sentence while controlling for negations. In the example below, I write a very basic code where I count the number of times "w" appear in "txt". Yet, I fail to control for negations like "don't" and/or "not".

w = ["hello", "apple"]

for word in w:
    txt = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
    print(txt.count(word))

The code should say that it finds "apple" only times and not 4. So, I would like to add: if, n words before or after the words in "w" there is a negation, then don't count, and otherwise.

N.B. Negations here are words like "don't" and "not".

Can anyone help me with this?

Thanks a lot for your help!

答案1

得分: 1

首先,在考虑否定的情况之前,str.count 可能没有做你期望的事情。

text = "我喜欢苹果,苹果是我最喜欢的水果。如果它们太成熟,我就不太喜欢苹果。如果它们不成熟,我也不喜欢苹果。"

text.count('苹果') # 输出: 4

但如果你这样做:

text = "小偷抢了菠萝和一篮苹果逃走了"

text.count('苹果') # 输出: 3

如果你想统计单词,你需要先进行一些标记化,将字符串转换为字符串列表,例如

from collections import Counter
import nltk
from nltk import word_tokenize

nltk.download('punkt')

text = "小偷抢了菠萝和一篮苹果逃走了"

Counter(word_tokenize(text))['苹果'] # 输出: 0
Counter(word_tokenize(text))['苹果'] # 输出: 1

然后你需要问自己,当你想要统计苹果出现的次数时,复数是否重要?如果是的话,那么你需要进行一些词干提取或词形还原。

这个教程可能会有帮助:https://www.kaggle.com/code/alvations/basic-nlp-with-nltk

假设你采用了词形和标记器,并考虑了你需要定义什么是“词”以及如何统计它们,你必须 定义否定是什么以及最终你想对计数做什么?

假设你想将文本拆分成对一些对象/名词具有积极和消极情感的“块”或子句。

然后你需要定义什么是消极/积极的,最简单的说法是

凡是否定词接近焦点名词的词我们认为是“消极的”,否则是“积极的”。

如果我们尝试用代码来量化这种最简单的否定概念,首先,你必须

  • 识别焦点词,比如我们拿苹果这个词,然后
  • 然后窗口,比如说前5个词和后5个词。

在代码中:

import nltk
from nltk import word_tokenize, ngrams

text = "我喜欢苹果,苹果是我最喜欢的水果。如果它们太成熟,我就不太喜欢苹果。如果它们不成熟,我也不喜欢苹果。"

NEGATIVE_WORDS = ["不", "不太", "没有"]
# 添加所有标记化的否定词形式
NEGATIVE_WORDS += [word_tokenize(w) for w in NEGATIVE_WORDS]

def count_negation(tokens):
    return sum(1 for word in tokens if word in NEGATIVE_WORDS)

for window in ngrams(word_tokenize(text), 5): 
  if "苹果" in window or "苹果" in window:
    print(count_negation(window), window)

[out]:

0 ('我', '喜欢', '苹果', ',', '苹果')
0 ('喜欢', '苹果', ',', '苹果', '是')
0 ('苹果', ',', '苹果', '是', '我')
0 (',', '苹果', '是', '我', '最')
0 ('苹果', '是', '我', '最', '喜欢')
0 ('我', '不', '太', '喜欢', '苹果')
0 ('不', '太', '喜欢', '苹果', '。')
0 ('太', '喜欢', '苹果', '。', '如果')
0 ('喜欢', '苹果', '。', '如果', '它们')
0 ('苹果', '。', '如果', '它们', '太')
1 ('我', '不', '喜欢', '苹果', '。')
1 ('不', '喜欢', '苹果', '。', '如果')
1 ('喜欢', '苹果', '。', '如果', '它们')
0 ('苹果', '。', '如果', '它们', '不')
0 (',', '苹果', '是', '我', '也')

Q: 但是当 我不喜欢苹果 被计算了3次,即使句子/子句在文本中只出现一次,这不是有点过分吗?

是的,这是过度计数,所以这又回到了计数否定的最终目标是什么的问题?

如果最终目标是有一个情感分类器,那么我认为词汇方法可能不如最先进的语言模型好,

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

tokenizer= AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "我喜欢苹果,苹果是我最喜欢的水果。如果它们太成熟,我就不太喜欢苹果。如果它们不成熟,我也不喜欢苹果。"


prompt=f"""我喜欢苹果吗?
查询:{text}
选项:
 - 是的,我喜欢苹果
 - 不,我讨厌苹果
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
tokenize.decode(model.generate(input_ids)[0], skip_special_tokens=True)

[out]:

是的我喜欢苹果

Q: 但是如果我想解释为什么模型认为对

英文:

Firstly, before you consider the negations/negatives, str.count might not be doing what you're expecting.

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

text.count('apple') # Outputs: 4

But if you do:

text = "The thief grappled the pineapples and ran away with a basket of apples"

text.count('apple') # Outputs: 3

If you want to count the words, you would need to do some tokenization first to change the string into a list of strings, e.g.

from collections import Counter

import nltk
from nltk import word_tokenize

nltk.download('punkt')

text = "The thief grappled the pineapples and ran away with a basket of apples"

Counter(word_tokenize(text))['apple'] # Output: 0
Counter(word_tokenize(text))['apples'] # Output: 1

Then you would need to ask yourself does plural matters when you want to count the no. of times apple/apples occur? If so, then you would have to do some stemming or lemmatization, https://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers

This tutorial might be helpful: https://www.kaggle.com/code/alvations/basic-nlp-with-nltk


Assuming that you adopt lemmas and tokenizers and consider whatever you need to define what is a "word" and how to count them, you have to define what is negation and what do you want to do with the counts ultimately?

Lets go with

> I want to break the text down into "chunks" or clauses that have positive and negative sentiment towards some object/nouns.

Then you would have to define what does negative/positive means, in the simplest terms you might say

> anything negation words that comes near the window of the focus noun we consider as "negative" and in any other case, positive.

And if we try to code up the simplest terms of quantifying negation as above, you would first, have to

  • identify the focus word, lets take the word apple and
  • then the window, lets say 5 words before and 5 words after.

In code:

import nltk
from nltk import word_tokenize, ngrams

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

NEGATIVE_WORDS = ["don't", "do not", "not"]
# Add all the forms of tokenized negative words
NEGATIVE_WORDS += [word_tokenize(w) for w in NEGATIVE_WORDS]

def count_negation(tokens):
    return sum(1 for word in tokens if word in NEGATIVE_WORDS)

for window in ngrams(word_tokenize(text), 5): 
  if "apple" in window or "apples" in window:
    print(count_negation(window), window)

[out]:

0 ('I', 'love', 'apples', ',', 'apple')
0 ('love', 'apples', ',', 'apple', 'are')
0 ('apples', ',', 'apple', 'are', 'my')
0 (',', 'apple', 'are', 'my', 'favorite')
0 ('apple', 'are', 'my', 'favorite', 'fruit')
0 ('do', "n't", 'really', 'like', 'apples')
0 ("n't", 'really', 'like', 'apples', 'if')
0 ('really', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'too')
1 ('I', 'do', 'not', 'like', 'apples')
1 ('do', 'not', 'like', 'apples', 'if')
1 ('not', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'immature')

Q: But isn't that kind of over-counting when I do not like apples get counted 3 times even though the sentence/clause appears once in the text?

Yes, it is over-counting, so it goes back to the question of what is the ultimate goal of counting the negations?

If the ultimate goal is to have a sentiment classifier then I think lexical approaches might not be as good as state-of-the-art language models, like:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

tokenizer= AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."


prompt=f"""Do I like apples or not?
QUERY:{text}
OPTIONS:
 - Yes, I like apples
 - No, I hate apples
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
tokenize.decode(model.generate(input_ids)[0], skip_special_tokens=True)

[out]:

Yes, I like apples

Q: But what if I want to explain why the model assumes positive/negative sentiments towards apple? How can I do it without counting negations?

A: Good point, it's an active research area to explain the outputs, so definitely, there's no clear answer yet but take a look at https://aclanthology.org/2022.coling-1.406

huangapple
  • 本文由 发表于 2023年3月21日 00:08:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75792678.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定