2023年5月11日 20:12:24go评论85阅读模式

英文:

Can the .loc command be used with groupby's apply function

问题

以下是您提供的代码部分的翻译：

data = {'id': ['205', '205', '204', '204', '204'],
        'Sentiment': ['Positive', 'Positive', 'Neutral', 'Positive', 'Positive']}
df = pd.DataFrame(data)
df['freq'] = df.groupby('Sentiment')['id'].transform(pd.Series.nunique)
df['freq_sum'] = df.groupby('id')['freq'].transform(pd.Series.count)
df['freq_cent'] = (df['freq'] / df['freq_sum'])
df['sent_inclination'] = df.loc[df['freq_cent'] > 0.5, ['Sentiment']]

请注意，这是代码的翻译部分。如果您有任何其他问题或需要进一步的帮助，请随时提出。

英文:

This question already has answers here:
Pandas conditional creation of a series/dataframe column (13 answers)
Your post has been associated with a similar question. If that question doesn’t answer your issue, edit your question to highlight the difference between the associated question and yours. If edited, your question will be reviewed and might be reopened.

Find out more about duplicates and why your question has been closed.

Closed yesterday.

Hi. I have a DataFrame with multiple columns where I have successfully assigned values in a new column (sent_inclination) given a condition. However, I want to change the output such that the values are given to the entire given group identified by the column "id". The condition is: if "freq_cent" > 0.5, then the new column should have the value from the column 'Sentiment' in it for the given id. That is whether the freq_cent is greater than 0.5 or less than 0.5 for a given observation, the new column should contain the sentiment value that qualifies for greater than 0.5 for that entire group.
I am able to assign values to the entire DataFrame but cannot make sure that the values are the same for the entire group.

Here is my sample DataFrame:

data = {&#39;id&#39;: [&#39;205&#39;, &#39;205&#39;, &#39;204&#39;, &#39;204&#39;, &#39;204&#39;], 
         &#39;Sentiment&#39;: [&#39;Positive&#39;, &#39;Positive&#39;, &#39;Neutral&#39;, &#39;Positive&#39;, &#39;Positive&#39;]}
df = pd.DataFrame(data)
df[&#39;freq&#39;] = df.groupby(&#39;Sentiment&#39;)[&#39;id&#39;].transform(pd.Series.nunique)
df[&#39;freq_sum&#39;] = df.groupby(&#39;id&#39;)[&#39;freq&#39;].transform(pd.Series.count)
df[&#39;freq_cent&#39;] = (df[&#39;freq&#39;]/df[&#39;freq_sum&#39;])

where if I apply the code:

df[&#39;sent_inclination&#39;] = df.loc[df[&#39;freq_cent&#39;] &gt;0.5, [&#39;Sentiment&#39;]]

I get the output:

    id	Sentiment	freq	freq_sum	freq_cent	sent_inclination
0	205	Positive	2	    2	        1.000000	Positive
1	205	Positive	2	    2	        1.000000	Positive
2	204	Neutral	    1	    3	        0.333333	NaN
3	204	Positive	2	    3	        0.666667	Positive
4	204	Positive	2	    3	        0.666667	Positive

The desired output should have 'sent_inclination' as Positive for all observations where id is 204, that is:

    id	Sentiment	freq	freq_sum	freq_cent	sent_inclination
0	205	Positive	2	    2	        1.000000	Positive
1	205	Positive	2	    2	        1.000000	Positive
2	204	Neutral	    1	    3	        0.333333	Positive
3	204	Positive	2	    3	        0.666667	Positive
4	204	Positive	2	    3	        0.666667	Positive

How can I achieve this? Any suggestions will be highly appreciated. Unfortunately the groupby.filter method doesn't work for me.

So far I have tried multiple codes, some of which are as follows:

df[&#39;sent_inclination&#39;] = df.loc[df.groupby(&#39;id&#39;).apply(lambda x: df.loc[df[&#39;freq_cent&#39;] &gt;0.5, df[&#39;Sentiment&#39;]])]
df[&#39;sent_inclination&#39;] = df.groupby(&#39;id&#39;).apply(lambda x: (df.query(&#39;freq_cent &gt;0.5&#39;)[&#39;Sentiment&#39;]))
df.groupby(&#39;id&#39;).apply(lambda x: x[&#39;sent_inclination&#39;] == x[&#39;Sentiment&#39;] if (x[&#39;freq_cent&#39;] &gt; 0.5) else &#39;&#39;)
df.groupby(&#39;id&#39;).apply(lambda x: x[&#39;sent_inclination&#39;] == (df.query(&#39;freq_cent &gt;0.5&#39;)[&#39;Sentiment&#39;]))

答案1

得分: 1

I recommend to use groupby from pandas and where from numpy:

import pandas as pd
import numpy as np
# this will get you an appended dataframe where the maximum per group is picked (you can also use "mean" instead of "max" to get the group average)
df = pd.merge(df, df.groupby(['id'])['freq_cent'].max().reset_index(), on='id', how='left')
# this will check if the value is greater than 0.5
df['sent_inclination'] = np.where(df['freq_cent_y'] > 0.5, 'Positive', df['Sentiment'])
# cleaning and renaming
df.rename(columns={"freq_cent_x": "freq_cent_x"}, inplace=True)
df = df[['id', 'freq', 'freq_sum', 'freq_cent_x', 'sent_inclination']]

Output:

print(df)
   id  freq  freq_sum  freq_cent_x sent_inclination
0  205     2         2     1.000000         Positive
1  205     2         2     1.000000         Positive
2  204     1         3     0.333333         Positive
3  204     2         3     0.666667         Positive
4  204     2         3     0.666667         Positive

Based on the condition, you can also adjust it; simply change the line with np.where:

df['sent_inclination'] = np.where(df['freq_cent_y'] > 0.5, 'Positive', np.where(df['freq_cent_y'] < 0.33, 'Negative', 'Neutral'))

This would give an outcome where >0.5 is "Positive," between 0.5 and 0.33 is "Neutral," and <0.33 is "Negative."

英文:

I recommend to use groupbyfrom pandas und wherefrom numpy:

import pandas as pd
import numpy as np
#this will get you a appended dataframe where the maximum per group is picked (you can also use &quot;mean&quot; instead of &quot;max&quot; to get the group average
df = pd.merge(df, df.groupby([&#39;id&#39;])[&#39;freq_cent&#39;].max().reset_index(), on=&#39;id&#39;, how=&#39;left&#39;)
#this will check the value is greater then 0,5
df[&#39;sent_inclination&#39;] = np.where(df[&#39;freq_cent_y&#39;] &gt;0.5, &#39;Positive&#39;, df[&#39;Sentiment&#39;])
#cleaning and rename
df.rename(columns={&quot;freq_cent_x&quot;: &quot;freq_cent_x&quot;}, inplace=True)
df = df[[&#39;id&#39;, &#39;freq&#39;, &#39;freq_sum&#39;, &#39;freq_cent_x&#39;, &#39;sent_inclination&#39;]]

output:

print(df)
   id  freq  freq_sum  freq_cent_x sent_inclination
0  205     2         2     1.000000         Positive
1  205     2         2     1.000000         Positive
2  204     1         3     0.333333         Positive
3  204     2         3     0.666667         Positive
4  204     2         3     0.666667         Positive

Based on the condition you can also adjust it, simple change the line with np.where:

df[&#39;sent_inclination&#39;] = np.where(df[&#39;freq_cent_y&#39;] &gt;0.5, &#39;Positive&#39;, np.where(df[&#39;freq_cent_y&#39; &lt;0.33, &#39;Negative&#39;, &#39;Neutral&#39;))

would give a outcome where >0.5 is "Positive", between 0.5 and 0.33 "Neutral", and <0.33 "Negative"

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

.loc命令能否与groupby的apply函数一起使用？

问题

答案1

发送txt文件到Telegram机器人的请求网站。

基于Pandas时间持续性的条件检查

Django通过其相关模型列表筛选模型。

使用functools.wraps来免费获取签名和类型提示？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。