使用pandas行值来在另一个数据框上执行`str.count`是否有高效的方法?

huangapple go评论44阅读模式
英文:

Is there an efficient way to use pandas row values to perform `str.count` on another dataframe?

问题

我理解了,你想要将上述代码中的注释部分进行翻译。以下是代码的翻译部分:

# 创建名为final的数据框,按空格拆分文本并展开,然后计算单词出现次数并重置索引
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['单词', '数量']
print(final.head())

# 创建名为df_new的新数据框
new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
             'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']
df_new = pd.DataFrame(new_data, columns=['Job'])

# 重置final数据框的索引并创建一个空列表
final.reset_index(drop=True, inplace=True)
df_list = []

# 遍历final数据框的行
for index, row in final.iterrows():
    # 创建匹配单词的正则表达式模式
    keyword_pattern = rf"\b{re.escape(row['单词'])}\b"
    # 统计df数据框中单词出现的总次数并添加到列表
    foo = df.Job.str.count(keyword_pattern).sum()
    df_list.append(foo)

# 将新的列添加到final数据框
final['新列'] = df_list
print(final.head())

希望这能帮助你理解代码的翻译部分。

英文:

I have the dataframe final that I constructed in the following way -

import pandas as pd
import re
data = ['mechanical@engineer plays with machines','field engineer works with oil pumps','lab_scientist trains a rat that plays the banjo','doctor kills patients',
        'computer-engineer creates killing AI','scientist/engineer publishes nothing']# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['words', 'number']
print(final.head())

I also have another dataframe, df_new -

new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
             'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']# Create the pandas DataFrame]

df_new = pd.DataFrame(new_data, columns=['Job'])

I'd like to count the number of times each word from the row words of the dataframe final appears in df_new.

Here's how I did it with a for loop -

final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
    keyword_pattern = rf"\b{re.escape(row['words'])}\b"
    foo = df.Job.str.count(keyword_pattern).sum()
    df_list.append(foo)

final['new_col'] = df_list

print(final.head())

Is there a more efficient way to do it, perhaps without a for loop? I was expecting a similar post on SO regarding this, but couldn't find any.

Here is the entire code for convenience -

import pandas as pd
import re
data = ['mechanical@engineer plays with machines','field engineer works with oil pumps','lab_scientist trains a rat that plays the banjo','doctor kills patients',
        'computer-engineer creates killing AI','scientist/engineer publishes nothing']# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['words', 'number']
print(final.head())

new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
             'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']# Create the pandas DataFrame]

df_new = pd.DataFrame(new_data, columns=['Job'])

final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
    keyword_pattern = rf"\b{re.escape(row['words'])}\b"
    foo = df.Job.str.count(keyword_pattern).sum()
    df_list.append(foo)

final['new_col'] = df_list
print(final.head())

答案1

得分: 1

你可以将所有内容合并为一个字符串,然后使用 strcount 方法:

entire_string = ','.join(df_new['Job'])
final['new_col'] = final['words'].apply(lambda x: entire_string.count(x))
英文:

You can join everything in one string and then use the method count of str:

entire_string = ','.join(df_new['Job'])
final['new_col'] = final['words'].apply(lambda x: entire_string.count(x))

答案2

得分: 1

你可以使用支持重叠匹配regex模块来构建一个单一的模式。然后,你可以将结果合并到final中。

import regex

all_words = (re.escape(word) for word in final['words'])
all_words = (rf'\b{word}\b' for word in all_words)
all_words = regex.compile(f'''({'|'.join(all_words)})''')

counts = (
   df_new['Job']
      .map(lambda row: all_words.findall(row, overlapped=True))
      .explode().value_counts()
)

print(
   pd.merge(final, counts, how='left', left_on='words', right_on='Job')
)

这是你的代码的翻译。

英文:

You can use the regex module which supports overlapping matches.

This allows you to build a single pattern.

You can .merge the result back into final.

import regex

all_words = (re.escape(word) for word in final['words'])
all_words = (rf'\b{word}\b' for word in all_words)
all_words = regex.compile(f'''({'|'.join(all_words)})''')

counts = (
   df_new['Job']
      .map(lambda row: all_words.findall(row, overlapped=True))
      .explode().value_counts()
)

print(
   pd.merge(final, counts, how='left', left_on='words', right_on='Job')
)
                  words  number  new_col  count
0                  with       2        2    2.0
1                 plays       2        1    1.0
2   mechanical@engineer       1        1    1.0
3                   the       1        0    NaN
4             publishes       1        1    1.0
5    scientist/engineer       1        1    1.0
6                    AI       1        1    1.0
7               killing       1        0    NaN
8               creates       1        1    1.0
9     computer-engineer       1        1    1.0
10             patients       1        0    NaN
11                kills       1        1    1.0
12               doctor       1        1    1.0
13                banjo       1        0    NaN
14                 that       1        0    NaN
15                  rat       1        1    1.0
16                    a       1        2    2.0
17               trains       1        1    1.0
18        lab_scientist       1        1    1.0
19                pumps       1        1    1.0
20                  oil       1        1    1.0
21                works       1        1    1.0
22             engineer       1        4    4.0
23                field       1        1    1.0
24             machines       1        1    1.0
25              nothing       1        0    NaN

huangapple
  • 本文由 发表于 2023年4月11日 06:51:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75981286.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定