英文:
Matching two dataframes with texts
问题
Sure, here is the translated code snippet:
import pandas as pd
# 创建一个样本 df_profanity_en DataFrame
df_profanity_en = pd.DataFrame({
'word': ['坏', '冒犯', '咒骂', '粗俗', '屁股']
})
# 创建一个样本 df_ed DataFrame
df_ed = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'text': ['这是一段干净的文本。让我们保证一下', '这里有一个坏词', '这个没有粗俗词', '注意你的用词!', '另一段干净的文本'],
'predicted_emotion': ['高兴', '伤心', '中性', '愤怒', '高兴']
})
# 创建一个包含 df_profanity_en 中检测到的文本的新 DataFrame,命名为 df_filtered
profane_words = df_profanity_en['word'].tolist()
df_filtered = df_ed[df_ed['text'].str.contains('|'.join(profane_words), case=False, na=False)]
# 添加一个新列,包含敏感词
df_filtered['profane_word'] = df_filtered['text'].apply(lambda text: next((word for word in profane_words if word.lower() in text.lower()), None))
# 可选:如果需要,重置索引
df_filtered.reset_index(drop=True, inplace=True)
# 打印筛选后的 DataFrame
df_filtered
输出:
id text predicted_emotion profane_word
1 这是一段干净的文本。让我们保证一下 高兴 屁股
2 这里有一个坏词 伤心 坏
英文:
I have a dataframe called df_profanity_en which contains a set of words. I have also another dataframe called df_ed with three columns (id, text, emotion_predicted).
I would like to create a new dataframe called df_filtered that contains only the texts of df_ed where words from the df_profanity_en are detected.
I have implemented the following code, but as you can see the word assure shouln't be there. Any idea to solve this issue?
import pandas as pd
# Create a sample df_profanity_en DataFrame
df_profanity_en = pd.DataFrame({
'word': ['bad', 'offensive', 'curse', 'vulgar', 'ass']
})
# Create a sample df_ed DataFrame
df_ed = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'text': ['This is a clean text. Lets assure', 'There is a bad word here', 'No profanity in this one', 'Watch your language!', 'Another clean text'],
'predicted_emotion': ['happy', 'sad', 'neutral', 'angry', 'happy']
})
# Create a list of profane words from df_profanity_en
profane_words = df_profanity_en['word'].tolist()
# Filter the texts in df_ed that contain profane words
df_filtered = df_ed[df_ed['text'].str.contains('|'.join(profane_words), case=False, na=False)]
# Add a new column containing the profanity word
df_filtered['profane_word'] = df_filtered['text'].apply(lambda text: next((word for word in profane_words if word.lower() in text.lower()), None))
# Optional: Reset the index if needed
df_filtered.reset_index(drop=True, inplace=True)
# Print the filtered DataFrame
df_filtered
Output:
id text predicted_emotion profane_word
1 This is a clean text. Lets assure happy ass
2 There is a bad word here sad bad
答案1
得分: 2
IIUC,您需要将单词边界(\b
)添加到正则表达式中以匹配整个单词:
pat = r"{}".format("|".join(r"\b{}\b".format(word) for word in df_profanity_en["word"]))
# '\\bbad\\b|\\boffensive\\b|\\bcurse\\b|\\bvulgar\\b|\\bass\\b'
m = df_ed["text"].str.contains(pat)
df_filtered = df_ed.loc[m]
输出:
print(df_filtered)
id text predicted_emotion
1 2 There is a bad word here sad
使用 extract
创建 profane_word
列:
df_filtered = (
df_ed.assign(profane_word=df_ed["text"].str.extract(f"({pat})"))
.dropna(subset="profane_word")
)
另一个变体(具有更清晰的模式),由 @mozway 使用:
import re
pat = fr"\b({'|'.join(map(re.escape, df_profanity_en["word"]))})\b"
# '\\b(bad|offensive|curse|vulgar|ass)\\b'
df_filtered = (
df_ed.assign(profane_word=df_ed["text"].str.extract(pat))
.dropna(subset="profane_word")
)
输出:
print(df_filtered)
id text predicted_emotion profane_word
1 2 There is a bad word here sad bad
英文:
IIUC, you have to add word boundaries (\b
) to match the entire word :
pat = r"{}".format("|".join(r"\b{}\b".format(word) for word in df_profanity_en["word"]))
#'\\bbad\\b|\\boffensive\\b|\\bcurse\\b|\\bvulgar\\b|\\bass\\b'
m = df_ed["text"].str.contains(pat)
df_filtered = df_ed.loc[m]
Output :
print(df_filtered)
id text predicted_emotion
1 2 There is a bad word here sad
With extract
to make the profane_word
column :
df_filtered = (
df_ed.assign(profane_word= df_ed["text"].str.extract(f"({pat})"))
.dropna(subset="profane_word")
)
Another variant (with a clearer pattern), used by @mozway :
import re
pat = fr"\b({'|'.join(map(re.escape, df_profanity_en["word"]))})\b"
#'\\b(bad|offensive|curse|vulgar|ass)\\b'
df_filtered = (
df_ed.assign(profane_word= df_ed["text"].str.extract(pat))
.dropna(subset="profane_word")
)
Output :
print(df_filtered)
id text predicted_emotion profane_word
1 2 There is a bad word here sad bad
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论