2023年4月11日 06:51:03go评论60阅读模式

英文:

Is there an efficient way to use pandas row values to perform `str.count` on another dataframe?

问题

我理解了，你想要将上述代码中的注释部分进行翻译。以下是代码的翻译部分：

# 创建名为final的数据框，按空格拆分文本并展开，然后计算单词出现次数并重置索引
df = pd.DataFrame(data, columns=['Job'])
final = df['Job'].str.split().explode().value_counts().reset_index()
final.columns = ['单词', '数量']
print(final.head())

# 创建名为df_new的新数据框
new_data = ['mechanical@engineer plays with machines while learning mechanics','field engineer works with oil pumps and gas cylinders','lab_scientist trains a rat',
             'doctor kills all','computer-engineer creates a conscious AI','scientist/engineer publishes something']
df_new = pd.DataFrame(new_data, columns=['Job'])

# 重置final数据框的索引并创建一个空列表
final.reset_index(drop=True, inplace=True)
df_list = []

# 遍历final数据框的行
for index, row in final.iterrows():
    # 创建匹配单词的正则表达式模式
    keyword_pattern = rf"\b{re.escape(row['单词'])}\b"
    # 统计df数据框中单词出现的总次数并添加到列表
    foo = df.Job.str.count(keyword_pattern).sum()
    df_list.append(foo)

# 将新的列添加到final数据框
final['新列'] = df_list
print(final.head())

希望这能帮助你理解代码的翻译部分。

英文:

I have the dataframe final that I constructed in the following way -

import pandas as pd
import re
data = [&#39;mechanical@engineer plays with machines&#39;,&#39;field engineer works with oil pumps&#39;,&#39;lab_scientist trains a rat that plays the banjo&#39;,&#39;doctor kills patients&#39;,
        &#39;computer-engineer creates killing AI&#39;,&#39;scientist/engineer publishes nothing&#39;]# Create the pandas DataFrame
df = pd.DataFrame(data, columns=[&#39;Job&#39;])
final = df[&#39;Job&#39;].str.split().explode().value_counts().reset_index()
final.columns = [&#39;words&#39;, &#39;number&#39;]
print(final.head())

I also have another dataframe, df_new -

new_data = [&#39;mechanical@engineer plays with machines while learning mechanics&#39;,&#39;field engineer works with oil pumps and gas cylinders&#39;,&#39;lab_scientist trains a rat&#39;,
             &#39;doctor kills all&#39;,&#39;computer-engineer creates a conscious AI&#39;,&#39;scientist/engineer publishes something&#39;]# Create the pandas DataFrame]

df_new = pd.DataFrame(new_data, columns=[&#39;Job&#39;])

I'd like to count the number of times each word from the row words of the dataframe final appears in df_new.

Here's how I did it with a for loop -

final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
    keyword_pattern = rf&quot;\b{re.escape(row[&#39;words&#39;])}\b&quot;
    foo = df.Job.str.count(keyword_pattern).sum()
    df_list.append(foo)

final[&#39;new_col&#39;] = df_list

print(final.head())

Is there a more efficient way to do it, perhaps without a for loop? I was expecting a similar post on SO regarding this, but couldn't find any.

Here is the entire code for convenience -

import pandas as pd
import re
data = [&#39;mechanical@engineer plays with machines&#39;,&#39;field engineer works with oil pumps&#39;,&#39;lab_scientist trains a rat that plays the banjo&#39;,&#39;doctor kills patients&#39;,
        &#39;computer-engineer creates killing AI&#39;,&#39;scientist/engineer publishes nothing&#39;]# Create the pandas DataFrame
df = pd.DataFrame(data, columns=[&#39;Job&#39;])
final = df[&#39;Job&#39;].str.split().explode().value_counts().reset_index()
final.columns = [&#39;words&#39;, &#39;number&#39;]
print(final.head())

new_data = [&#39;mechanical@engineer plays with machines while learning mechanics&#39;,&#39;field engineer works with oil pumps and gas cylinders&#39;,&#39;lab_scientist trains a rat&#39;,
             &#39;doctor kills all&#39;,&#39;computer-engineer creates a conscious AI&#39;,&#39;scientist/engineer publishes something&#39;]# Create the pandas DataFrame]

df_new = pd.DataFrame(new_data, columns=[&#39;Job&#39;])

final.reset_index(drop = True, inplace=True)
df_list = []
for index, row in final.iterrows():
    keyword_pattern = rf&quot;\b{re.escape(row[&#39;words&#39;])}\b&quot;
    foo = df.Job.str.count(keyword_pattern).sum()
    df_list.append(foo)

final[&#39;new_col&#39;] = df_list
print(final.head())

答案1

得分: 1

你可以将所有内容合并为一个字符串，然后使用 str 的 count 方法：

entire_string = ','.join(df_new['Job'])
final['new_col'] = final['words'].apply(lambda x: entire_string.count(x))

英文:

You can join everything in one string and then use the method count of str:

entire_string = &#39;,&#39;.join(df_new[&#39;Job&#39;])
final[&#39;new_col&#39;] = final[&#39;words&#39;].apply(lambda x: entire_string.count(x))

答案2

得分: 1

你可以使用支持重叠匹配的regex模块来构建一个单一的模式。然后，你可以将结果合并到final中。

import regex

all_words = (re.escape(word) for word in final['words'])
all_words = (rf'\b{word}\b' for word in all_words)
all_words = regex.compile(f'''({'|'.join(all_words)})''')

counts = (
   df_new['Job']
      .map(lambda row: all_words.findall(row, overlapped=True))
      .explode().value_counts()
)

print(
   pd.merge(final, counts, how='left', left_on='words', right_on='Job')
)

这是你的代码的翻译。

英文:

You can use the regex module which supports overlapping matches.

This allows you to build a single pattern.

You can .merge the result back into final.

import regex

all_words = (re.escape(word) for word in final[&#39;words&#39;])
all_words = (rf&#39;\b{word}\b&#39; for word in all_words)
all_words = regex.compile(f&#39;&#39;&#39;({&#39;|&#39;.join(all_words)})&#39;&#39;&#39;)

counts = (
   df_new[&#39;Job&#39;]
      .map(lambda row: all_words.findall(row, overlapped=True))
      .explode().value_counts()
)

print(
   pd.merge(final, counts, how=&#39;left&#39;, left_on=&#39;words&#39;, right_on=&#39;Job&#39;)
)

                  words  number  new_col  count
0                  with       2        2    2.0
1                 plays       2        1    1.0
2   mechanical@engineer       1        1    1.0
3                   the       1        0    NaN
4             publishes       1        1    1.0
5    scientist/engineer       1        1    1.0
6                    AI       1        1    1.0
7               killing       1        0    NaN
8               creates       1        1    1.0
9     computer-engineer       1        1    1.0
10             patients       1        0    NaN
11                kills       1        1    1.0
12               doctor       1        1    1.0
13                banjo       1        0    NaN
14                 that       1        0    NaN
15                  rat       1        1    1.0
16                    a       1        2    2.0
17               trains       1        1    1.0
18        lab_scientist       1        1    1.0
19                pumps       1        1    1.0
20                  oil       1        1    1.0
21                works       1        1    1.0
22             engineer       1        4    4.0
23                field       1        1    1.0
24             machines       1        1    1.0
25              nothing       1        0    NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用pandas行值来在另一个数据框上执行`str.count`是否有高效的方法？

问题

答案1

答案2

将Pivot Like数据转换为JSON使用Python或Pandas

如何向DataFrame添加零数组列

无法比较日期

在pandas中按正则表达式拆分列并保留匹配项：

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论