获取每个时间点的过去成功率,使用pandas。

huangapple go评论70阅读模式
英文:

get past success rate at each timepoint using pandas

问题

a) 对于每个用户,计算两个指标 - 过去项目成功计数和过去项目成功比例

计算这两个指标的逻辑如下:

past_project_success_count:计算过去项目中success_flag等于1的数量
past_project_success_ratio:使用以下公式计算比率 -> 所有过去成功项目数(对于每个用户)除以所有过去项目数(对于每个用户)

b) 类似地,我想为每个客户计算相同的指标

c) 类似地,我想为每个用户和客户组合计算相同的指标

d) 我们应该只考虑过去的项目并计算这些指标。不考虑当前行/当前项目。(只考虑过去项目)

您之前尝试的方法存在问题,累积计数有时会返回相同的值,这可能是由于分组不正确或排序不正确引起的。

以下是一个可能的解决方案:

import pandas as pd

# 先按项目结束日期排序
df['project_end_date'] = pd.to_datetime(df['project_end_date'])
df = df.sort_values(by=['user_id', 'customer_id', 'project_end_date'])

# 计算过去项目成功计数
df['past_project_success_count'] = df.groupby(['user_id', 'customer_id'])['success_flag'].cumsum() - df['success_flag']

# 计算过去项目总计数
df['past_project_count'] = df.groupby(['user_id', 'customer_id']).cumcount()

# 计算过去项目成功比例
df['past_project_success_ratio'] = df['past_project_success_count'] / df['past_project_count']

# 填充第一个项目的NaN值
df['past_project_success_count'].fillna(0, inplace=True)
df['past_project_success_ratio'].fillna(0, inplace=True)

# 重置索引
df.reset_index(drop=True, inplace=True)

# 输出DataFrame
df[['user_id', 'customer_id', 'project_end_date', 'past_project_success_count', 'past_project_success_ratio']]

这段代码将为每个用户、客户和它们的组合计算过去项目的成功计数和成功比例。

英文:

I have a dataframe like as below

data = {
    'project_id': [11, 22, 31, 45, 52, 61],
    'user_id': [10001, 10002, 10001, 10003, 10002, 10004],
    'customer_id': [20001, 20002, 20001, 20001, 20003, 20002],
    'project_start_date': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01', '2023-05-01', '2023-06-01'],
    'project_end_date': ['2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30', '2023-05-31', '2023-05-10'],
    'product_name': ['Product A', 'Product B', 'Product C', 'Product A', 'Product B', 'Product C'],
    'success_flag': [0, 1, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

I would like to do the below

a) for each user, compute two metrics - past_project_success_count, past_project_success_ratio

The logic to compute these two metrics is given below

past_project_success_count : count the number of past projects where the success_flag = 1
past_project_success_ratio : find the ratio using formula -> all past success projects (for each user) divided by all past projects (for each user)

b) Similarly, I would like to compute the same metrics for each customer

c) Similarly, I would like to compute the same metrics for each user and customer combo

d) we should only look at the past and compute these metrics. Should not consider the current row/current project. (only past projects)

I tried the below but not going anywhere with this

t1 = t1.sort_values('project_end_date')
t1['user_past_project_success_count'] = t1.groupby('user_id')['success_flag'].cumsum() / t1.groupby('user_id').cumcount()
t1['user_past_project_success_count'] = t1.groupby('customer_id')['success_flag'].cumsum() / t1.groupby('customer_id').cumcount()
t1['user_past_project_success_count'] = t1.groupby(['user_id','customer_id'])['success_flag'].cumsum() / t1.groupby(['user_id','customer_id']).cumcount()

one issue that I observed was sometime the cumcount returns the same value I don't know why. For past project count, it should just be in increasing order of 1 for each of the project.

I expect my output to be like as below. I have shown only for the user column. But we need to do the same for customer column and user. customer combo. Remember that first record for each group will always be zero as there is no past to look at.

获取每个时间点的过去成功率,使用pandas。

答案1

得分: 1

这个循环遍历你想要分组的组,并创建带有统计信息的新列:

for name, cols in zip(
        ["user", "customer", "user_and_customer"],
        ["user_id", "customer_id", ["user_id", "customer_id"]]):
    # 为计数创建新列
    df[f"past_successes_count_{name}"] = df.sort_values(
        "project_end_date").groupby(cols)["success_flag"].transform(
            # 累积总和减去当前值以仅包括以前的数据
            lambda x: x.cumsum() - x)
    # 再创建一个用于通过将上面的列除以累积计数来计算比率的列
    df[f"past_successes_ratio_{name}"] = df[f"past_successes_count_{name}"].div(
        (df.sort_values(
            "project_end_date").groupby(cols).cumcount()))
# 将任何NaN值填充为0
df.fillna(0, inplace=True)

虽然 .cumsum() 包括当前行在计算中,因此需要 x.cumsum() - x,但 .cumcount() 不包括当前行,所以只使用“过去”的数据。

英文:

This loop loops through the groups you want to group on and creates new columns with the statistics:

for name, cols in zip(
        ["user", "customer", "user_and_customer"],
        ["user_id", "customer_id", ["user_id", "customer_id"]]):
    # create new column for count
    df[f"past_successes_count_{name}"] = df.sort_values(
        "project_end_date").groupby(cols)["success_flag"].transform(
            # cumulative sum subtract current so only previous
            lambda x: x.cumsum() - x)
    # and another for ratio by dividing column above by cumulative count
    df[f"past_successes_ratio_{name}"] = df[f"past_successes_count_{name}"].div(
        (df.sort_values(
            "project_end_date").groupby(cols).cumcount()))
# fill any nan with 0
df.fillna(0, inplace=True)

Whilst .cumsum() includes the current row in calculations, thus the need for x.cumsum() - x, .cumcount() does not - so only uses 'past' data.

huangapple
  • 本文由 发表于 2023年5月17日 23:08:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76273567.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定