在Python中计算每行与一组中的某一行的相关性。

huangapple go评论72阅读模式
英文:

Calculating the correlation of each row with one in Group in Python

问题

我需要一些帮助来计算我的表格的4个指标。
在部门内不断引用相同字符串的正确方法是什么?

我已经尝试了很长时间来解决这个问题,但是我无法得出一个合乎逻辑的解决方案。
我尝试了SQLalchemy的Partition,但我的技能不够。我也尝试了def函数,但函数不是我的强项。

因此,我请求您提供解决方案或建议。谢谢!!!

我的真正任务与薪水无关))))不要担心!

示例:

import pandas as pd

df2 = pd.DataFrame({
    'head_department': ['Anna', 'Anna', 'Anna', 'Anna', 'Anna',
                        'John', 'John', 'John',
                        'Denis', 'Denis',
                        'Sarah',
                        'Greg', 'Greg', 'Greg', 'Greg', 'Greg', 'Greg'],
    'serial_number': [1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 1, 1, 2, 3, 4, 5, 6],
    'Worker_in_dpt': ['Anna', 'Anna-s_worker_2', 'Anna-s_worker_3', 'Anna-s_worker_4', 'Anna-s_worker_5',
                      'John', 'John-s worker_2', 'John-s_worker_3',
                      'Denis', 'Denis-s_worker_2',
                      'Sarah',
                      'Greg', 'Greg-s_worker_2', 'Greg-s_worker_3', 'Greg-s_worker_4', 'Greg-s_worker_5',
                      'Greg-s_worker_6'],
    'jan_salary': [1000, 600, 600, 500, 900, None, 600, 500, 1200, 800, 1400, None, 700, 600, 600, 450, 700],
    'feb_salary': [1100, 700, 700, 700, 800, None, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
    'mar_salary': [1200, 800, 800, 900, 700, 1300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, 700],
    'apr_salary': [1600, 900, 900, 700, 700, 2300, 500, 400, 1800, 900, 1100, 1900, 200, 900, 500, 150, 700],
    'may_salary': [1100, 700, 700, 700, 800, 2300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
    'jun_salary': [1200, 800, 800, 900, 700, 1300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 800],
    'jul_salary': [1000, 600, 600, 500, 900, 1300, 600, 500, 1200, 800, 1400, 1200, 700, 600, 600, 450, 700],
    'aug_salary': [1100, 700, 700, 700, 800, 2300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, None]
})

df2['serial_number'] = df2['serial_number'].astype('str')  # 否则会被计算为整数,当我使用numeric_only=True时
df2['mean_salary_per_period'] = df2.mean(numeric_only=True, axis=1)
df2

我正在尝试计算以下列:

  • df2['corr_with_head_in_dpt'] =
  • df2['mean _when_1_in_NAN'] =
  • df2['mean _when_1_not_in_NAN'] =
  • df2['sum_all_dpt_in_jan'] =

其中:

  • df2['corr with head in dpt'] - 计算每个工作人员在部门内整个时期内与该部门负责人的工资相关性(1),部门负责人的标志始终为1!

  • df2['mean _when_1_in _NAN'] - 部门负责人为空时每个工作人员的平均工资。如果部门负责人没有NaN,那么整个时期的平均工资。

  • df2['mean _when_1_in_NAN'] - 从部门负责人的工资首次出现的第一个月开始,每个工作人员的平均工资。如果部门负责人没有NaN,则整个时期的平均工资。

  • df2['sum all dpt in Jan'] - 一月份所有部门工资的总和,包括部门负责人的工资,即使它也是NaN。

谢谢!

英文:

I need some help to calculate 4 metrics for my table.
What is the correct way to constantly refer to the same string within the department?

I have been trying to solve this problem for a very long time, but I can not come to a logical solution.
I have tried Partition from SQLalchemy, but my skills wasn't enough. I tried the def function, but functions are my weak point.

Therefore, I ask you for help with a solution or advice. Thank you!!!

my real task is not about salaries)))) don't worry about it!

EXAMPLE:

import pandas as pd
df2=pd.DataFrame({
'head_department': [ 'Anna','Anna', 'Anna','Anna','Anna',
'John', 'John','John',
'Denis', 'Denis',
'Sarah',
'Greg', 'Greg','Greg','Greg', 'Greg','Greg'],
'serial_number': [1,2,3,4,5,1,2,3,1,2,1,1,2,3,4,5,6],
'Worker_in_dpt': ['Anna','Anna-s_worker_2','Anna-s_worker_3','Anna-s_worker_4','Anna-s_worker_5',
'John','John-s worker_2','John-s_worker_3',
'Denis','Denis-s_worker_2',
'Sarah',
'Greg','Greg-s_worker_2', 'Greg-s_worker_3', 'Greg-s_worker_4','Greg-s_worker_5', 'Greg-s_worker_6'],
'jan_salary': [1000, 600, 600, 500, 900, None, 600, 500, 1200, 800, 1400, None, 700, 600, 600, 450, 700],
'feb_salary': [1100, 700, 700, 700, 800, None, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
'mar_salary': [1200, 800, 800, 900, 700, 1300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, 700],
'apr_salary': [1600, 900, 900, 700, 700, 2300, 500, 400, 1800, 900, 1100, 1900, 200, 900, 500, 150, 700],
'may_salary': [1100, 700, 700, 700, 800, 2300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
'jun_salary': [1200, 800, 800, 900, 700, 1300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 800],
'jul_salary': [1000, 600, 600, 500, 900, 1300, 600, 500, 1200, 800, 1400, 1200, 700, 600, 600, 450, 700],
'aug_salary': [1100, 700, 700, 700, 800, 2300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, None]
})
df2['serial_number'] = df2['serial_number'].astype('str') # else it computed as int, when i use numeric_only=True
df2[ 'mean_salary_per_period'] = df2.mean(numeric_only=True, axis=1)
df2

enter image description here


i'm trying calculate columns:

  • df2['corr_with_head_in_dpt'] =
  • df2['mean _when_1_in_NAN'] =
  • df2['mean _when_1_not_in_NAN'] =
  • df2['sum_all_dpt_in_jan'] =

WHERE:

  • df2['corr with head in dpt'] - Calculation of the correlation of the salary of each worker for the entire period within the department with the head of this department (1), the flag of the head of the department is always = 1!

  • df2['mean _when_1_in _NAN'] - the average salary of each worker for the period while the head of department is NAN. if the head of department has no NAN, then the average for the all period.

  • df2['mean _when_1_in_NAN'] - the average salary of each worker for the period starting from the first month, when the salary appeared at the head of the department. if the head of department had no NA then the average for the all period.

  • df2['sum all dpt in Jan'] - the sum of all workers department salaries in January, including the head of department's, even it's NAN too.

Thank you!!!

答案1

得分: 1

以下将使用groupby.apply()与一个函数来获得所需的输出:

df = df2.loc[:, ~df2.columns.isin(
    ["Worker_in_dpt", "mean_salary_per_period"])] \
    .set_index(['head_department', 'serial_number']).stack(dropna=False)

def func(frame):
    # 为了将序列号展开为列
    frame = frame.unstack(level=["serial_number"])
    # 计算相关性并取第一行
    corr = frame.corr().iloc[0]
    # 找到序列号1(第一列)中为NaN的月份
    nan = frame.iloc[:, 0].isna()
    # 非NaN值的平均值
    mean_not_nan = frame.loc[~nan, :].mean(axis=0)
    # 如果存在NaN值,使用它们的平均值,否则与非NaN值相同
    if nan.sum() > 0:    
        mean_nan = frame.loc[nan, :].mean(axis=0)
    else:
        mean_nan = mean_not_nan.copy()
    # 计算1月份的总和
    sum_jan = pd.Series(data=frame.loc[
        frame.index.get_level_values(-1).str.contains("jan")].sum(axis=1)[0],
        index=frame.columns)
    
    # 连接所有系列并命名列
    return pd.concat([
        corr, mean_nan, mean_not_nan, sum_jan], axis=1) \
        .set_axis(
            ['corr with head in dpt',
             'mean _when_1_in _NAN',
             'mean _when_1_in_NAN',
             'sum all dpt in Jan'], axis=1)

# 按head_department分组并应用func
additional_cols = df.groupby(level="head_department").apply(func)

# 将df2与分组输出合并
out = pd.merge(df2, additional_cols,
               left_on=['head_department', 'serial_number'],
               right_index=True)
英文:

The following will give you the desired output, using groupby.apply() with a function:

df = df2.loc[:, ~df2.columns.isin(
    ["Worker_in_dpt", "mean_salary_per_period"])] \
    .set_index(['head_department', 'serial_number']).stack(dropna=False)

def func(frame):
    # unstack for serial number as columns
    frame = frame.unstack(level=["serial_number"])
    # correlations and take first row
    corr = frame.corr().iloc[0]
    # find months that are nan for serial number 1 (first column)
    nan = frame.iloc[:, 0].isna()
    # mean of non-nans
    mean_not_nan = frame.loc[~nan, :].mean(axis=0)
    # if nan, mean for these, else same as non-nans
    if nan.sum() > 0:    
        mean_nan = frame.loc[nan, :].mean(axis=0)
    else:
        mean_nan = mean_not_nan.copy()
    # sum for january
    sum_jan = pd.Series(data=frame.loc[
        frame.index.get_level_values(-1).str.contains("jan")].sum(axis=1)[0],
        index=frame.columns)
    
    # concat all series and name columns
    return pd.concat([
        corr, mean_nan, mean_not_nan, sum_jan], axis=1) \
        .set_axis(
            ['corr with head in dpt',
             'mean _when_1_in _NAN',
             'mean _when_1_in_NAN',
             'sum all dpt in Jan'], axis=1)

# groupby head_department and apply func
additional_cols = df.groupby(level="head_department").apply(func)

# merge df2 with grouped outputs
out = pd.merge(df2, additional_cols,
               left_on=['head_department', 'serial_number'],
               right_index=True)

huangapple
  • 本文由 发表于 2023年7月11日 00:21:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76655614.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定