如何根据条件计算 Pandas 数据框中特定值的出现次数及其所占比例?

huangapple go评论76阅读模式
英文:

How to compute occurrencies of specific value and its percentage for each column based on condition pandas dataframe?

问题

以下是翻译好的内容:

我有以下的数据框 df,在其中我用绿色突出显示了感兴趣的单元格:查看图片描述,我想要对每一列(因此考虑整个数据框)获取以下统计信息:值小于或等于 0.5 的出现次数(数据框中的绿色单元格) - 不包括 NaN 值 - 以及在考虑的列中的百分比,以便使用 50% 作为基准。

对于所提到的问题,我尝试使用像这样的 value_counts:(df['A'].value_counts()/df['A'].count())*100,但这返回的是部分结果,并且仅适用于特定列;我还考虑使用筛选器或 lambda 函数,例如 df.loc[lambda x: x <= 0.5],但显然这不是我想要的结果。

目标/输出将是如下所示的数据框,其中仅显示“超过”基准的列(回想一下:至少有一半 50% 的值小于等于 0.5)。

查看图片描述
例如,在列 A 中,计数将为 2,百分比将为:2/3 * 100 = 66%,而在列 B 中,计数将为 4,百分比将为:4/8 * 100 = 50%。(对于列 X、Y 和 Z 也是一样的。)另一方面,在列 C 中,2/8 * 100 = 25% 不会超过基准,因此不包括在输出中。

在您看来,是否有合适的方法可以实现这一目标?如果这是一个重复的问题,我提前道歉,但我找不到其他能够帮助我的问题,感谢任何拯救者。

英文:

I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.

For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.

The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).

enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.

Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.

答案1

得分: 0

我相信我已经理解了您在下面的代码中提出的请求...
如果您能在问题中提供预期的输出,那将更容易理解。

无论如何,下面的代码的第一部分只是设置,所以可以忽略,因为您已经设置好了数据。
基本上,我已经为您创建了一个快速的函数,它将返回低于您可以定义的阈值的值的百分比。
这个函数在循环中调用了数据帧中的所有列,如果这个百分比超过输出阈值(您也可以定义它),则会将其保留以进行实际输出。

import pandas as pd
import numpy as np
import random
import datetime

### 设置 ###

base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]

def rand_num_list(length):
    peak = [round(random.uniform(0, 1), 1) for i in range(length)] + [0] * (10 - length)
    random.shuffle(peak)
    return peak

df = pd.DataFrame(
    {
        'A': rand_num_list(3),
        'B': rand_num_list(5),
        'C': rand_num_list(7),
        'D': rand_num_list(2),
        'E': rand_num_list(6),
        'F': rand_num_list(4)
    },
    index=date_list
)

df = df.replace({0: np.nan})

##############

print(df)

def less_than_threshold(thresh_df, thresh_col, threshold):
    if len(thresh_df[thresh_col].dropna()) == 0:
        return 0

    return len(thresh_df.loc[thresh_df[thresh_col] <= threshold]) / len(thresh_df[thresh_col].dropna())

output_dict = {'cols': []}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
    if less_than_threshold(df, col, col_threshold) >= output_threshold:
        output_dict['cols'].append(col)

df_output = df.loc[:, output_dict.get('cols')]

print(df_output)

希望这实现了您的目标
英文:

I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.

Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.

import pandas as pd
import numpy as np
import random
import datetime

### SET UP ###

base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]

def rand_num_list(length):
    peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
    random.shuffle(peak)
    return peak


df = pd.DataFrame(
    {
        &#39;A&#39;:rand_num_list(3),
        &#39;B&#39;:rand_num_list(5),
        &#39;C&#39;:rand_num_list(7),
        &#39;D&#39;:rand_num_list(2),
        &#39;E&#39;:rand_num_list(6),
        &#39;F&#39;:rand_num_list(4)
    },
    index=date_list
)

df = df.replace({0:np.nan})

##############

print(df)

def less_than_threshold(thresh_df, thresh_col, threshold):
    if len(thresh_df[thresh_col].dropna()) == 0:
        return 0

    return len(thresh_df.loc[thresh_df[thresh_col]&lt;=threshold]) / len(thresh_df[thresh_col].dropna())

output_dict = {&#39;cols&#39;:[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
    if less_than_threshold(df, col, col_threshold) &gt;= output_threshold:
        output_dict[&#39;cols&#39;].append(col)
    
df_output = df.loc[:,output_dict.get(&#39;cols&#39;)]

print(df_output)

Hope this achieves your goal!

huangapple
  • 本文由 发表于 2023年1月9日 18:37:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056044.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定