英文:
Is there an efficient way to use pandas row values to perform `str.count` on another dataframe?
问题
我理解了,你想要将上述代码中的注释部分进行翻译。以下是代码的翻译部分:
# 创建名为final的数据框,按空格拆分文本并展开,然后计算单词出现次数并重置索引
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['单词', '数量']
print(final.head())
# 创建名为df_new的新数据框
new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']
df_new = pd.DataFrame(new_data, columns=['Job'])
# 重置final数据框的索引并创建一个空列表
final.reset_index(drop=True, inplace=True)
df_list = []
# 遍历final数据框的行
for index, row in final.iterrows():
# 创建匹配单词的正则表达式模式
keyword_pattern = rf"\b{re.escape(row['单词'])}\b"
# 统计df数据框中单词出现的总次数并添加到列表
foo = df.Job.str.count(keyword_pattern).sum()
df_list.append(foo)
# 将新的列添加到final数据框
final['新列'] = df_list
print(final.head())
希望这能帮助你理解代码的翻译部分。
英文:
I have the dataframe final that I constructed in the following way -
import pandas as pd
import re
data = ['mechanical@engineer plays with machines','field engineer works with oil pumps','lab_scientist trains a rat that plays the banjo','doctor kills patients',
'computer-engineer creates killing AI','scientist/engineer publishes nothing']# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['words', 'number']
print(final.head())
I also have another dataframe, df_new
-
new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']# Create the pandas DataFrame]
df_new = pd.DataFrame(new_data, columns=['Job'])
I'd like to count the number of times each word from the row words of the dataframe final
appears in df_new
.
Here's how I did it with a for loop -
final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
keyword_pattern = rf"\b{re.escape(row['words'])}\b"
foo = df.Job.str.count(keyword_pattern).sum()
df_list.append(foo)
final['new_col'] = df_list
print(final.head())
Is there a more efficient way to do it, perhaps without a for loop? I was expecting a similar post on SO regarding this, but couldn't find any.
Here is the entire code for convenience -
import pandas as pd
import re
data = ['mechanical@engineer plays with machines','field engineer works with oil pumps','lab_scientist trains a rat that plays the banjo','doctor kills patients',
'computer-engineer creates killing AI','scientist/engineer publishes nothing']# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['words', 'number']
print(final.head())
new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']# Create the pandas DataFrame]
df_new = pd.DataFrame(new_data, columns=['Job'])
final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
keyword_pattern = rf"\b{re.escape(row['words'])}\b"
foo = df.Job.str.count(keyword_pattern).sum()
df_list.append(foo)
final['new_col'] = df_list
print(final.head())
答案1
得分: 1
你可以将所有内容合并为一个字符串,然后使用 str
的 count
方法:
entire_string = ','.join(df_new['Job'])
final['new_col'] = final['words'].apply(lambda x: entire_string.count(x))
英文:
You can join everything in one string and then use the method count
of str
:
entire_string = ','.join(df_new['Job'])
final['new_col'] = final['words'].apply(lambda x: entire_string.count(x))
答案2
得分: 1
你可以使用支持重叠匹配的regex模块来构建一个单一的模式。然后,你可以将结果合并到final
中。
import regex
all_words = (re.escape(word) for word in final['words'])
all_words = (rf'\b{word}\b' for word in all_words)
all_words = regex.compile(f'''({'|'.join(all_words)})''')
counts = (
df_new['Job']
.map(lambda row: all_words.findall(row, overlapped=True))
.explode().value_counts()
)
print(
pd.merge(final, counts, how='left', left_on='words', right_on='Job')
)
这是你的代码的翻译。
英文:
You can use the regex module which supports overlapping matches.
This allows you to build a single pattern.
You can .merge
the result back into final
.
import regex
all_words = (re.escape(word) for word in final['words'])
all_words = (rf'\b{word}\b' for word in all_words)
all_words = regex.compile(f'''({'|'.join(all_words)})''')
counts = (
df_new['Job']
.map(lambda row: all_words.findall(row, overlapped=True))
.explode().value_counts()
)
print(
pd.merge(final, counts, how='left', left_on='words', right_on='Job')
)
words number new_col count
0 with 2 2 2.0
1 plays 2 1 1.0
2 mechanical@engineer 1 1 1.0
3 the 1 0 NaN
4 publishes 1 1 1.0
5 scientist/engineer 1 1 1.0
6 AI 1 1 1.0
7 killing 1 0 NaN
8 creates 1 1 1.0
9 computer-engineer 1 1 1.0
10 patients 1 0 NaN
11 kills 1 1 1.0
12 doctor 1 1 1.0
13 banjo 1 0 NaN
14 that 1 0 NaN
15 rat 1 1 1.0
16 a 1 2 2.0
17 trains 1 1 1.0
18 lab_scientist 1 1 1.0
19 pumps 1 1 1.0
20 oil 1 1 1.0
21 works 1 1 1.0
22 engineer 1 4 4.0
23 field 1 1 1.0
24 machines 1 1 1.0
25 nothing 1 0 NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论