2023年5月21日 20:45:16go评论72阅读模式

英文:

Matching two dataframes with texts

问题

Sure, here is the translated code snippet:

import pandas as pd

# 创建一个样本 df_profanity_en DataFrame
df_profanity_en = pd.DataFrame({
    'word': ['坏', '冒犯', '咒骂', '粗俗', '屁股']
})

# 创建一个样本 df_ed DataFrame
df_ed = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'text': ['这是一段干净的文本。让我们保证一下', '这里有一个坏词', '这个没有粗俗词', '注意你的用词！', '另一段干净的文本'],
    'predicted_emotion': ['高兴', '伤心', '中性', '愤怒', '高兴']
})

# 创建一个包含 df_profanity_en 中检测到的文本的新 DataFrame，命名为 df_filtered
profane_words = df_profanity_en['word'].tolist()
df_filtered = df_ed[df_ed['text'].str.contains('|'.join(profane_words), case=False, na=False)]

# 添加一个新列，包含敏感词
df_filtered['profane_word'] = df_filtered['text'].apply(lambda text: next((word for word in profane_words if word.lower() in text.lower()), None))

# 可选：如果需要，重置索引
df_filtered.reset_index(drop=True, inplace=True)

# 打印筛选后的 DataFrame
df_filtered

输出：

   id	text	                            predicted_emotion      profane_word
	1	这是一段干净的文本。让我们保证一下	高兴	       屁股
	2	这里有一个坏词	                伤心	       坏

英文:

I have a dataframe called df_profanity_en which contains a set of words. I have also another dataframe called df_ed with three columns (id, text, emotion_predicted).

I would like to create a new dataframe called df_filtered that contains only the texts of df_ed where words from the df_profanity_en are detected.

I have implemented the following code, but as you can see the word assure shouln't be there. Any idea to solve this issue?

import pandas as pd

# Create a sample df_profanity_en DataFrame
df_profanity_en = pd.DataFrame({
    &#39;word&#39;: [&#39;bad&#39;, &#39;offensive&#39;, &#39;curse&#39;, &#39;vulgar&#39;, &#39;ass&#39;]
})

# Create a sample df_ed DataFrame
df_ed = pd.DataFrame({
    &#39;id&#39;: [1, 2, 3, 4, 5],
    &#39;text&#39;: [&#39;This is a clean text. Lets assure&#39;, &#39;There is a bad word here&#39;, &#39;No profanity in this one&#39;, &#39;Watch your language!&#39;, &#39;Another clean text&#39;],
    &#39;predicted_emotion&#39;: [&#39;happy&#39;, &#39;sad&#39;, &#39;neutral&#39;, &#39;angry&#39;, &#39;happy&#39;]
})

# Create a list of profane words from df_profanity_en
profane_words = df_profanity_en[&#39;word&#39;].tolist()

# Filter the texts in df_ed that contain profane words
df_filtered = df_ed[df_ed[&#39;text&#39;].str.contains(&#39;|&#39;.join(profane_words), case=False, na=False)]

# Add a new column containing the profanity word
df_filtered[&#39;profane_word&#39;] = df_filtered[&#39;text&#39;].apply(lambda text: next((word for word in profane_words if word.lower() in text.lower()), None))

# Optional: Reset the index if needed
df_filtered.reset_index(drop=True, inplace=True)

# Print the filtered DataFrame
df_filtered

Output:

   id	text	                            predicted_emotion      profane_word
	1	This is a clean text. Lets assure	happy	               ass
	2	There is a bad word here	        sad	                   bad

答案1

得分: 2

IIUC，您需要将单词边界（\b）添加到正则表达式中以匹配整个单词：

pat = r"{}".format("|".join(r"\b{}\b".format(word) for word in df_profanity_en["word"]))
# '\\bbad\\b|\\boffensive\\b|\\bcurse\\b|\\bvulgar\\b|\\bass\\b'

m = df_ed["text"].str.contains(pat)

df_filtered = df_ed.loc[m]

输出：

print(df_filtered)
    
   id                      text predicted_emotion
1   2  There is a bad word here               sad

使用 extract 创建 profane_word 列：

df_filtered = (
    df_ed.assign(profane_word=df_ed["text"].str.extract(f"({pat})"))
        .dropna(subset="profane_word")
)

另一个变体（具有更清晰的模式），由 @mozway 使用：

import re

pat = fr"\b({'|'.join(map(re.escape, df_profanity_en["word"]))})\b"
# '\\b(bad|offensive|curse|vulgar|ass)\\b'

df_filtered = (
    df_ed.assign(profane_word=df_ed["text"].str.extract(pat))
        .dropna(subset="profane_word")
)

输出：

print(df_filtered)
    
   id                      text predicted_emotion profane_word
1   2  There is a bad word here               sad          bad

英文:

IIUC, you have to add word boundaries (\b) to match the entire word :

pat = r&quot;{}&quot;.format(&quot;|&quot;.join(r&quot;\b{}\b&quot;.format(word) for word in df_profanity_en[&quot;word&quot;]))
#&#39;\\bbad\\b|\\boffensive\\b|\\bcurse\\b|\\bvulgar\\b|\\bass\\b&#39;

m = df_ed[&quot;text&quot;].str.contains(pat)

df_filtered = df_ed.loc[m]

Output :

print(df_filtered)

   id                      text predicted_emotion
1   2  There is a bad word here               sad

With extract to make the profane_word column :

df_filtered = (
    df_ed.assign(profane_word= df_ed[&quot;text&quot;].str.extract(f&quot;({pat})&quot;))
            .dropna(subset=&quot;profane_word&quot;)
)

Another variant (with a clearer pattern), used by @mozway :

import re

pat = fr&quot;\b({&#39;|&#39;.join(map(re.escape, df_profanity_en[&quot;word&quot;]))})\b&quot;
#&#39;\\b(bad|offensive|curse|vulgar|ass)\\b&#39;

df_filtered = (
    df_ed.assign(profane_word= df_ed[&quot;text&quot;].str.extract(pat))
            .dropna(subset=&quot;profane_word&quot;)
)

Output :

print(df_filtered)

   id                      text predicted_emotion profane_word
1   2  There is a bad word here               sad          bad

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

匹配两个带有文本的数据框。

问题

答案1

如何向 Pine 脚本代码中添加简单函数

从我的网页中抓取数值并存储在SQLite数据库中。

Is it possible to dynamically enable/disable arguments in a Python Discord slash command using Pycord and Discord.py based on previous user choice?

如何创建一个抽象类，强制实现者成为数据类

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论