使用groupby()创建移动平均,不使用组内的第一个值。

huangapple go评论55阅读模式
英文:

Creating a moving average with groupby() not using the first value in the group

问题

我正在尝试在使用groupby函数后创建移动平均值。我有患者(假设的)数据,已分组,然后我尝试创建他们的乳酸的移动平均值,回顾以前的数据,即第一个值应为NaN,然后第二个值应为第一个观察到的乳酸,依此类推。

df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')

grouped = df.groupby(['Test ID', 'Appointment Date:'])

# 计算累积和
df['Cumulative Sum'] = grouped['LAC_missing'].cumsum()

df['Count'] = grouped.cumcount()

# 计算移动平均值(当'Count' > 0时,除以'Count',否则设为NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

查看图片描述

英文:

I'm trying to create a moving average after using the groupby function. I have patient's (hypothetical) data that is made into groups, then I try to create a moving average of their Lactate looking back, as in the first value should be NaN, then the second value should be the first observed Lactate, and so on.

df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')

grouped = df.groupby(['Test ID', 'Appointment Date:'])

# Calculate the cumulative sum 
df['Cumulative Sum'] = grouped['LAC_missing'].cumsum()

df['Count'] = grouped.cumcount()

# Calculate the moving average (divide by 'Count' where 'Count' > 0, otherwise set to NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

enter image description here

答案1

得分: 0

看起来(根据你的代码),你想要类似以下的效果:

import pandas as pd
import numpy as np

# 创建一个示例数据集
data = {
    'Test ID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'Appointment Date': ['2023-07-16', '2023-07-17', '2023-07-18'] * 3,
    'LAC_missing': [1.4, np.nan, 1.0, np.nan, 1.7, 1.8, 1.2, 1.3, 1.9]
}

df = pd.DataFrame(data)

df['Appointment Date'] = pd.to_datetime(df['Appointment Date'])

# 根据你的代码,似乎你想要这样做:
df['LAC Moving Average'] = df.groupby('Test ID')['LAC_missing'].cumsum() / df.groupby('Test ID')['LAC_missing'].cumcount().where(lambda x: x != 0, pd.NA)

# 但是“就像第一个值应该是NaN,然后第二个值应该是第一个观察到的Lactate,依此类推”让我认为可能是这样:
df['Previous LAC'] = df.groupby('Test ID')['LAC_missing'].shift(1)

   Test ID Appointment Date  LAC_missing  LAC Moving Average  Previous LAC
0        1       2023-07-16          1.4                 NaN           NaN
1        1       2023-07-17          NaN                 NaN           1.4
2        1       2023-07-18          1.0                1.20           NaN
3        2       2023-07-16          NaN                 NaN           NaN
4        2       2023-07-17          1.7                1.70           NaN
5        2       2023-07-18          1.8                1.75           1.7
6        3       2023-07-16          1.2                 NaN           NaN
7        3       2023-07-17          1.3                2.50           1.2
8        3       2023-07-18          1.9                2.20           1.3
英文:

It looks like (from your code) you're wanting something like this:

import pandas as pd
import numpy as np

# Create a toy dataset
data = {
    'Test ID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'Appointment Date': ['2023-07-16', '2023-07-17', '2023-07-18'] * 3,
    'LAC_missing': [1.4, np.nan, 1.0, np.nan, 1.7, 1.8, 1.2, 1.3, 1.9]
}

df = pd.DataFrame(data)

df['Appointment Date'] = pd.to_datetime(df['Appointment Date'])

# it looks like from your code you are doing this:
df['LAC Moving Average'] = df.groupby('Test ID')['LAC_missing'].cumsum() / df.groupby('Test ID')['LAC_missing'].cumcount().where(lambda x: x != 0, pd.NA)

# but "as in the first value should be NaN, then the second value should be the first observed Lactate, and so on" makes me think it might be this:
df['Previous LAC'] = df.groupby('Test ID')['LAC_missing'].shift(1)

   Test ID Appointment Date  LAC_missing  LAC Moving Average  Previous LAC
0        1       2023-07-16          1.4                 NaN           NaN
1        1       2023-07-17          NaN                 NaN           1.4
2        1       2023-07-18          1.0                1.20           NaN
3        2       2023-07-16          NaN                 NaN           NaN
4        2       2023-07-17          1.7                1.70           NaN
5        2       2023-07-18          1.8                1.75           1.7
6        3       2023-07-16          1.2                 NaN           NaN
7        3       2023-07-17          1.3                2.50           1.2
8        3       2023-07-18          1.9                2.20           1.3

答案2

得分: 0

# 将 'LAC_missing' 列转换为数值,将非数值值替换为 NaN
df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')

grouped = df.groupby(['Test ID', 'Appointment Date:'])

# 但是 "就像第一个值应该是 NaN,然后第二个值应该是第一个观察到的乳酸,依此类推" 使我认为可能是这样的:
df['Previous LAC'] = grouped['LAC_missing'].shift(1)

# 计算累积和
df['Cumulative Sum'] = grouped['Previous LAC'].cumsum()

# 用每个组中前一个非空值来填充 'Cumulative Sum' 列中的 NaN 值
df['Cumulative Sum'] = df['Cumulative Sum'].fillna(method='ffill')

# 计算每个组中当前时间点的非 NaN 值的计数
count_df = grouped['Previous LAC'].apply(lambda x: x.notna().cumsum()).reset_index(drop=True)

df['Count'] = count_df.reset_index(drop=True)

# 计算移动平均值(除以 'Count',其中 'Count' > 0,否则设置为 NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

这是你提供的代码的中文翻译部分。

英文:
# Convert 'LAC_missing' column to numeric, replacing non-numeric values with NaN
df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')

grouped = df.groupby(['Test ID', 'Appointment Date:'])

# but "as in the first value should be NaN, then the second value should be the first observed Lactate, and so on" makes me think it might be this:
df['Previous LAC'] = grouped['LAC_missing'].shift(1)

# Calculate the cumulative sum
df['Cumulative Sum'] = grouped['Previous LAC'].cumsum()

# Fill the NaN values in 'Cumulative Sum' column with the previous non-null value in each group
df['Cumulative Sum'] = df['Cumulative Sum'].fillna(method='ffill')

# Calculate the count of non-NaN values for each group up to the current time
count_df = grouped['Previous LAC'].apply(lambda x: x.notna().cumsum()).reset_index(drop=True)

df['Count'] = count_df.reset_index(drop=True)

# Calculate the moving average (divide by 'Count' where 'Count' > 0, otherwise set to NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

huangapple
  • 本文由 发表于 2023年7月18日 08:10:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76708797.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定