
huangapple go评论44阅读模式

Is there an efficient way to use pandas row values to perform `str.count` on another dataframe?



# 创建名为final的数据框,按空格拆分文本并展开,然后计算单词出现次数并重置索引
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['单词', '数量']

# 创建名为df_new的新数据框
new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
             'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']
df_new = pd.DataFrame(new_data, columns=['Job'])

# 重置final数据框的索引并创建一个空列表
final.reset_index(drop=True, inplace=True)
df_list = []

# 遍历final数据框的行
for index, row in final.iterrows():
    # 创建匹配单词的正则表达式模式
    keyword_pattern = rf"\b{re.escape(row['单词'])}\b"
    # 统计df数据框中单词出现的总次数并添加到列表
    foo = df.Job.str.count(keyword_pattern).sum()

# 将新的列添加到final数据框
final['新列'] = df_list



I have the dataframe final that I constructed in the following way -

import pandas as pd
import re
data = ['mechanical@engineer plays with machines','field engineer works with oil pumps','lab_scientist trains a rat that plays the banjo','doctor kills patients',
        'computer-engineer creates killing AI','scientist/engineer publishes nothing']# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['words', 'number']

I also have another dataframe, df_new -

new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
             'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']# Create the pandas DataFrame]

df_new = pd.DataFrame(new_data, columns=['Job'])

I'd like to count the number of times each word from the row words of the dataframe final appears in df_new.

Here's how I did it with a for loop -

final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
    keyword_pattern = rf"\b{re.escape(row['words'])}\b"
    foo = df.Job.str.count(keyword_pattern).sum()

final['new_col'] = df_list


Is there a more efficient way to do it, perhaps without a for loop? I was expecting a similar post on SO regarding this, but couldn't find any.

Here is the entire code for convenience -

import pandas as pd
import re
data = ['mechanical@engineer plays with machines','field engineer works with oil pumps','lab_scientist trains a rat that plays the banjo','doctor kills patients',
        'computer-engineer creates killing AI','scientist/engineer publishes nothing']# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['words', 'number']

new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
             'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']# Create the pandas DataFrame]

df_new = pd.DataFrame(new_data, columns=['Job'])

final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
    keyword_pattern = rf"\b{re.escape(row['words'])}\b"
    foo = df.Job.str.count(keyword_pattern).sum()

final['new_col'] = df_list


得分: 1

你可以将所有内容合并为一个字符串,然后使用 strcount 方法:

entire_string = ','.join(df_new['Job'])
final['new_col'] = final['words'].apply(lambda x: entire_string.count(x))

You can join everything in one string and then use the method count of str:

entire_string = ','.join(df_new['Job'])
final['new_col'] = final['words'].apply(lambda x: entire_string.count(x))


得分: 1


import regex

all_words = (re.escape(word) for word in final['words'])
all_words = (rf'\b{word}\b' for word in all_words)
all_words = regex.compile(f'''({'|'.join(all_words)})''')

counts = (
      .map(lambda row: all_words.findall(row, overlapped=True))

   pd.merge(final, counts, how='left', left_on='words', right_on='Job')



You can use the regex module which supports overlapping matches.

This allows you to build a single pattern.

You can .merge the result back into final.

import regex

all_words = (re.escape(word) for word in final['words'])
all_words = (rf'\b{word}\b' for word in all_words)
all_words = regex.compile(f'''({'|'.join(all_words)})''')

counts = (
      .map(lambda row: all_words.findall(row, overlapped=True))

   pd.merge(final, counts, how='left', left_on='words', right_on='Job')
                  words  number  new_col  count
0                  with       2        2    2.0
1                 plays       2        1    1.0
2   mechanical@engineer       1        1    1.0
3                   the       1        0    NaN
4             publishes       1        1    1.0
5    scientist/engineer       1        1    1.0
6                    AI       1        1    1.0
7               killing       1        0    NaN
8               creates       1        1    1.0
9     computer-engineer       1        1    1.0
10             patients       1        0    NaN
11                kills       1        1    1.0
12               doctor       1        1    1.0
13                banjo       1        0    NaN
14                 that       1        0    NaN
15                  rat       1        1    1.0
16                    a       1        2    2.0
17               trains       1        1    1.0
18        lab_scientist       1        1    1.0
19                pumps       1        1    1.0
20                  oil       1        1    1.0
21                works       1        1    1.0
22             engineer       1        4    4.0
23                field       1        1    1.0
24             machines       1        1    1.0
25              nothing       1        0    NaN

  • 本文由 发表于 2023年4月11日 06:51:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75981286.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
