英文:
Creating a moving average with groupby() not using the first value in the group
问题
我正在尝试在使用groupby函数后创建移动平均值。我有患者(假设的)数据,已分组,然后我尝试创建他们的乳酸的移动平均值,回顾以前的数据,即第一个值应为NaN,然后第二个值应为第一个观察到的乳酸,依此类推。
df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
grouped = df.groupby(['Test ID', 'Appointment Date:'])
# 计算累积和
df['Cumulative Sum'] = grouped['LAC_missing'].cumsum()
df['Count'] = grouped.cumcount()
# 计算移动平均值(当'Count' > 0时,除以'Count',否则设为NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)
英文:
I'm trying to create a moving average after using the groupby function. I have patient's (hypothetical) data that is made into groups, then I try to create a moving average of their Lactate looking back, as in the first value should be NaN, then the second value should be the first observed Lactate, and so on.
df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
grouped = df.groupby(['Test ID', 'Appointment Date:'])
# Calculate the cumulative sum
df['Cumulative Sum'] = grouped['LAC_missing'].cumsum()
df['Count'] = grouped.cumcount()
# Calculate the moving average (divide by 'Count' where 'Count' > 0, otherwise set to NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)
答案1
得分: 0
看起来(根据你的代码),你想要类似以下的效果:
import pandas as pd
import numpy as np
# 创建一个示例数据集
data = {
'Test ID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Appointment Date': ['2023-07-16', '2023-07-17', '2023-07-18'] * 3,
'LAC_missing': [1.4, np.nan, 1.0, np.nan, 1.7, 1.8, 1.2, 1.3, 1.9]
}
df = pd.DataFrame(data)
df['Appointment Date'] = pd.to_datetime(df['Appointment Date'])
# 根据你的代码,似乎你想要这样做:
df['LAC Moving Average'] = df.groupby('Test ID')['LAC_missing'].cumsum() / df.groupby('Test ID')['LAC_missing'].cumcount().where(lambda x: x != 0, pd.NA)
# 但是“就像第一个值应该是NaN,然后第二个值应该是第一个观察到的Lactate,依此类推”让我认为可能是这样:
df['Previous LAC'] = df.groupby('Test ID')['LAC_missing'].shift(1)
Test ID Appointment Date LAC_missing LAC Moving Average Previous LAC
0 1 2023-07-16 1.4 NaN NaN
1 1 2023-07-17 NaN NaN 1.4
2 1 2023-07-18 1.0 1.20 NaN
3 2 2023-07-16 NaN NaN NaN
4 2 2023-07-17 1.7 1.70 NaN
5 2 2023-07-18 1.8 1.75 1.7
6 3 2023-07-16 1.2 NaN NaN
7 3 2023-07-17 1.3 2.50 1.2
8 3 2023-07-18 1.9 2.20 1.3
英文:
It looks like (from your code) you're wanting something like this:
import pandas as pd
import numpy as np
# Create a toy dataset
data = {
'Test ID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Appointment Date': ['2023-07-16', '2023-07-17', '2023-07-18'] * 3,
'LAC_missing': [1.4, np.nan, 1.0, np.nan, 1.7, 1.8, 1.2, 1.3, 1.9]
}
df = pd.DataFrame(data)
df['Appointment Date'] = pd.to_datetime(df['Appointment Date'])
# it looks like from your code you are doing this:
df['LAC Moving Average'] = df.groupby('Test ID')['LAC_missing'].cumsum() / df.groupby('Test ID')['LAC_missing'].cumcount().where(lambda x: x != 0, pd.NA)
# but "as in the first value should be NaN, then the second value should be the first observed Lactate, and so on" makes me think it might be this:
df['Previous LAC'] = df.groupby('Test ID')['LAC_missing'].shift(1)
Test ID Appointment Date LAC_missing LAC Moving Average Previous LAC
0 1 2023-07-16 1.4 NaN NaN
1 1 2023-07-17 NaN NaN 1.4
2 1 2023-07-18 1.0 1.20 NaN
3 2 2023-07-16 NaN NaN NaN
4 2 2023-07-17 1.7 1.70 NaN
5 2 2023-07-18 1.8 1.75 1.7
6 3 2023-07-16 1.2 NaN NaN
7 3 2023-07-17 1.3 2.50 1.2
8 3 2023-07-18 1.9 2.20 1.3
答案2
得分: 0
# 将 'LAC_missing' 列转换为数值,将非数值值替换为 NaN
df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
grouped = df.groupby(['Test ID', 'Appointment Date:'])
# 但是 "就像第一个值应该是 NaN,然后第二个值应该是第一个观察到的乳酸,依此类推" 使我认为可能是这样的:
df['Previous LAC'] = grouped['LAC_missing'].shift(1)
# 计算累积和
df['Cumulative Sum'] = grouped['Previous LAC'].cumsum()
# 用每个组中前一个非空值来填充 'Cumulative Sum' 列中的 NaN 值
df['Cumulative Sum'] = df['Cumulative Sum'].fillna(method='ffill')
# 计算每个组中当前时间点的非 NaN 值的计数
count_df = grouped['Previous LAC'].apply(lambda x: x.notna().cumsum()).reset_index(drop=True)
df['Count'] = count_df.reset_index(drop=True)
# 计算移动平均值(除以 'Count',其中 'Count' > 0,否则设置为 NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)
这是你提供的代码的中文翻译部分。
英文:
# Convert 'LAC_missing' column to numeric, replacing non-numeric values with NaN
df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
grouped = df.groupby(['Test ID', 'Appointment Date:'])
# but "as in the first value should be NaN, then the second value should be the first observed Lactate, and so on" makes me think it might be this:
df['Previous LAC'] = grouped['LAC_missing'].shift(1)
# Calculate the cumulative sum
df['Cumulative Sum'] = grouped['Previous LAC'].cumsum()
# Fill the NaN values in 'Cumulative Sum' column with the previous non-null value in each group
df['Cumulative Sum'] = df['Cumulative Sum'].fillna(method='ffill')
# Calculate the count of non-NaN values for each group up to the current time
count_df = grouped['Previous LAC'].apply(lambda x: x.notna().cumsum()).reset_index(drop=True)
df['Count'] = count_df.reset_index(drop=True)
# Calculate the moving average (divide by 'Count' where 'Count' > 0, otherwise set to NaN)
df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论