匹配两个带有文本的数据框。

huangapple go评论72阅读模式
英文:

Matching two dataframes with texts

问题

Sure, here is the translated code snippet:

import pandas as pd

# 创建一个样本 df_profanity_en DataFrame
df_profanity_en = pd.DataFrame({
    'word': ['坏', '冒犯', '咒骂', '粗俗', '屁股']
})

# 创建一个样本 df_ed DataFrame
df_ed = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'text': ['这是一段干净的文本。让我们保证一下', '这里有一个坏词', '这个没有粗俗词', '注意你的用词!', '另一段干净的文本'],
    'predicted_emotion': ['高兴', '伤心', '中性', '愤怒', '高兴']
})

# 创建一个包含 df_profanity_en 中检测到的文本的新 DataFrame,命名为 df_filtered
profane_words = df_profanity_en['word'].tolist()
df_filtered = df_ed[df_ed['text'].str.contains('|'.join(profane_words), case=False, na=False)]

# 添加一个新列,包含敏感词
df_filtered['profane_word'] = df_filtered['text'].apply(lambda text: next((word for word in profane_words if word.lower() in text.lower()), None))

# 可选:如果需要,重置索引
df_filtered.reset_index(drop=True, inplace=True)

# 打印筛选后的 DataFrame
df_filtered

输出:

   id	text	                            predicted_emotion      profane_word
	1	这是一段干净的文本。让我们保证一下	高兴	       屁股
	2	这里有一个坏词	                伤心	       坏
英文:

I have a dataframe called df_profanity_en which contains a set of words. I have also another dataframe called df_ed with three columns (id, text, emotion_predicted).

I would like to create a new dataframe called df_filtered that contains only the texts of df_ed where words from the df_profanity_en are detected.

I have implemented the following code, but as you can see the word assure shouln't be there. Any idea to solve this issue?

import pandas as pd

# Create a sample df_profanity_en DataFrame
df_profanity_en = pd.DataFrame({
    'word': ['bad', 'offensive', 'curse', 'vulgar', 'ass']
})

# Create a sample df_ed DataFrame
df_ed = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'text': ['This is a clean text. Lets assure', 'There is a bad word here', 'No profanity in this one', 'Watch your language!', 'Another clean text'],
    'predicted_emotion': ['happy', 'sad', 'neutral', 'angry', 'happy']
})

# Create a list of profane words from df_profanity_en
profane_words = df_profanity_en['word'].tolist()

# Filter the texts in df_ed that contain profane words
df_filtered = df_ed[df_ed['text'].str.contains('|'.join(profane_words), case=False, na=False)]

# Add a new column containing the profanity word
df_filtered['profane_word'] = df_filtered['text'].apply(lambda text: next((word for word in profane_words if word.lower() in text.lower()), None))

# Optional: Reset the index if needed
df_filtered.reset_index(drop=True, inplace=True)

# Print the filtered DataFrame
df_filtered

Output:

   id	text	                            predicted_emotion      profane_word
	1	This is a clean text. Lets assure	happy	               ass
	2	There is a bad word here	        sad	                   bad

答案1

得分: 2

IIUC,您需要将单词边界\b)添加到正则表达式中以匹配整个单词:

pat = r"{}".format("|".join(r"\b{}\b".format(word) for word in df_profanity_en["word"]))
# '\\bbad\\b|\\boffensive\\b|\\bcurse\\b|\\bvulgar\\b|\\bass\\b'

m = df_ed["text"].str.contains(pat)

df_filtered = df_ed.loc[m]

输出:

print(df_filtered)
    
   id                      text predicted_emotion
1   2  There is a bad word here               sad

使用 extract 创建 profane_word 列:

df_filtered = (
    df_ed.assign(profane_word=df_ed["text"].str.extract(f"({pat})"))
        .dropna(subset="profane_word")
)

另一个变体(具有更清晰的模式),由 @mozway 使用:

import re

pat = fr"\b({'|'.join(map(re.escape, df_profanity_en["word"]))})\b"
# '\\b(bad|offensive|curse|vulgar|ass)\\b'

df_filtered = (
    df_ed.assign(profane_word=df_ed["text"].str.extract(pat))
        .dropna(subset="profane_word")
)

输出:

print(df_filtered)
    
   id                      text predicted_emotion profane_word
1   2  There is a bad word here               sad          bad
英文:

IIUC, you have to add word boundaries (\b) to match the entire word :

pat = r"{}".format("|".join(r"\b{}\b".format(word) for word in df_profanity_en["word"]))
#'\\bbad\\b|\\boffensive\\b|\\bcurse\\b|\\bvulgar\\b|\\bass\\b'

m = df_ed["text"].str.contains(pat)

df_filtered = df_ed.loc[m]

Output :

print(df_filtered)

   id                      text predicted_emotion
1   2  There is a bad word here               sad

With extract to make the profane_word column :

df_filtered = (
    df_ed.assign(profane_word= df_ed["text"].str.extract(f"({pat})"))
            .dropna(subset="profane_word")
)

Another variant (with a clearer pattern), used by @mozway :

import re

pat = fr"\b({'|'.join(map(re.escape, df_profanity_en["word"]))})\b"
#'\\b(bad|offensive|curse|vulgar|ass)\\b'

df_filtered = (
    df_ed.assign(profane_word= df_ed["text"].str.extract(pat))
            .dropna(subset="profane_word")
)

Output :

print(df_filtered)

   id                      text predicted_emotion profane_word
1   2  There is a bad word here               sad          bad

huangapple
  • 本文由 发表于 2023年5月21日 20:45:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76299970.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定