使用groupby()创建移动平均,不使用组内的第一个值。

huangapple go评论90阅读模式
英文:

Creating a moving average with groupby() not using the first value in the group

问题

我正在尝试在使用groupby函数后创建移动平均值。我有患者(假设的)数据,已分组,然后我尝试创建他们的乳酸的移动平均值,回顾以前的数据,即第一个值应为NaN,然后第二个值应为第一个观察到的乳酸,依此类推。

  1. df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
  2. grouped = df.groupby(['Test ID', 'Appointment Date:'])
  3. # 计算累积和
  4. df['Cumulative Sum'] = grouped['LAC_missing'].cumsum()
  5. df['Count'] = grouped.cumcount()
  6. # 计算移动平均值(当'Count' > 0时,除以'Count',否则设为NaN)
  7. df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

查看图片描述

英文:

I'm trying to create a moving average after using the groupby function. I have patient's (hypothetical) data that is made into groups, then I try to create a moving average of their Lactate looking back, as in the first value should be NaN, then the second value should be the first observed Lactate, and so on.

  1. df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
  2. grouped = df.groupby(['Test ID', 'Appointment Date:'])
  3. # Calculate the cumulative sum
  4. df['Cumulative Sum'] = grouped['LAC_missing'].cumsum()
  5. df['Count'] = grouped.cumcount()
  6. # Calculate the moving average (divide by 'Count' where 'Count' > 0, otherwise set to NaN)
  7. df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

enter image description here

答案1

得分: 0

看起来(根据你的代码),你想要类似以下的效果:

  1. import pandas as pd
  2. import numpy as np
  3. # 创建一个示例数据集
  4. data = {
  5. 'Test ID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
  6. 'Appointment Date': ['2023-07-16', '2023-07-17', '2023-07-18'] * 3,
  7. 'LAC_missing': [1.4, np.nan, 1.0, np.nan, 1.7, 1.8, 1.2, 1.3, 1.9]
  8. }
  9. df = pd.DataFrame(data)
  10. df['Appointment Date'] = pd.to_datetime(df['Appointment Date'])
  11. # 根据你的代码,似乎你想要这样做:
  12. df['LAC Moving Average'] = df.groupby('Test ID')['LAC_missing'].cumsum() / df.groupby('Test ID')['LAC_missing'].cumcount().where(lambda x: x != 0, pd.NA)
  13. # 但是“就像第一个值应该是NaN,然后第二个值应该是第一个观察到的Lactate,依此类推”让我认为可能是这样:
  14. df['Previous LAC'] = df.groupby('Test ID')['LAC_missing'].shift(1)
  15. Test ID Appointment Date LAC_missing LAC Moving Average Previous LAC
  16. 0 1 2023-07-16 1.4 NaN NaN
  17. 1 1 2023-07-17 NaN NaN 1.4
  18. 2 1 2023-07-18 1.0 1.20 NaN
  19. 3 2 2023-07-16 NaN NaN NaN
  20. 4 2 2023-07-17 1.7 1.70 NaN
  21. 5 2 2023-07-18 1.8 1.75 1.7
  22. 6 3 2023-07-16 1.2 NaN NaN
  23. 7 3 2023-07-17 1.3 2.50 1.2
  24. 8 3 2023-07-18 1.9 2.20 1.3
英文:

It looks like (from your code) you're wanting something like this:

  1. import pandas as pd
  2. import numpy as np
  3. # Create a toy dataset
  4. data = {
  5. 'Test ID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
  6. 'Appointment Date': ['2023-07-16', '2023-07-17', '2023-07-18'] * 3,
  7. 'LAC_missing': [1.4, np.nan, 1.0, np.nan, 1.7, 1.8, 1.2, 1.3, 1.9]
  8. }
  9. df = pd.DataFrame(data)
  10. df['Appointment Date'] = pd.to_datetime(df['Appointment Date'])
  11. # it looks like from your code you are doing this:
  12. df['LAC Moving Average'] = df.groupby('Test ID')['LAC_missing'].cumsum() / df.groupby('Test ID')['LAC_missing'].cumcount().where(lambda x: x != 0, pd.NA)
  13. # but "as in the first value should be NaN, then the second value should be the first observed Lactate, and so on" makes me think it might be this:
  14. df['Previous LAC'] = df.groupby('Test ID')['LAC_missing'].shift(1)
  15. Test ID Appointment Date LAC_missing LAC Moving Average Previous LAC
  16. 0 1 2023-07-16 1.4 NaN NaN
  17. 1 1 2023-07-17 NaN NaN 1.4
  18. 2 1 2023-07-18 1.0 1.20 NaN
  19. 3 2 2023-07-16 NaN NaN NaN
  20. 4 2 2023-07-17 1.7 1.70 NaN
  21. 5 2 2023-07-18 1.8 1.75 1.7
  22. 6 3 2023-07-16 1.2 NaN NaN
  23. 7 3 2023-07-17 1.3 2.50 1.2
  24. 8 3 2023-07-18 1.9 2.20 1.3

答案2

得分: 0

  1. # 将 'LAC_missing' 列转换为数值,将非数值值替换为 NaN
  2. df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
  3. grouped = df.groupby(['Test ID', 'Appointment Date:'])
  4. # 但是 "就像第一个值应该是 NaN,然后第二个值应该是第一个观察到的乳酸,依此类推" 使我认为可能是这样的:
  5. df['Previous LAC'] = grouped['LAC_missing'].shift(1)
  6. # 计算累积和
  7. df['Cumulative Sum'] = grouped['Previous LAC'].cumsum()
  8. # 用每个组中前一个非空值来填充 'Cumulative Sum' 列中的 NaN 值
  9. df['Cumulative Sum'] = df['Cumulative Sum'].fillna(method='ffill')
  10. # 计算每个组中当前时间点的非 NaN 值的计数
  11. count_df = grouped['Previous LAC'].apply(lambda x: x.notna().cumsum()).reset_index(drop=True)
  12. df['Count'] = count_df.reset_index(drop=True)
  13. # 计算移动平均值(除以 'Count',其中 'Count' > 0,否则设置为 NaN)
  14. df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

这是你提供的代码的中文翻译部分。

英文:
  1. # Convert 'LAC_missing' column to numeric, replacing non-numeric values with NaN
  2. df['LAC_missing'] = pd.to_numeric(df['LAC_missing'], errors='coerce')
  3. grouped = df.groupby(['Test ID', 'Appointment Date:'])
  4. # but "as in the first value should be NaN, then the second value should be the first observed Lactate, and so on" makes me think it might be this:
  5. df['Previous LAC'] = grouped['LAC_missing'].shift(1)
  6. # Calculate the cumulative sum
  7. df['Cumulative Sum'] = grouped['Previous LAC'].cumsum()
  8. # Fill the NaN values in 'Cumulative Sum' column with the previous non-null value in each group
  9. df['Cumulative Sum'] = df['Cumulative Sum'].fillna(method='ffill')
  10. # Calculate the count of non-NaN values for each group up to the current time
  11. count_df = grouped['Previous LAC'].apply(lambda x: x.notna().cumsum()).reset_index(drop=True)
  12. df['Count'] = count_df.reset_index(drop=True)
  13. # Calculate the moving average (divide by 'Count' where 'Count' > 0, otherwise set to NaN)
  14. df['LAC Moving Average'] = df['Cumulative Sum'] / df['Count'].where(df['Count'] > 0)

huangapple
  • 本文由 发表于 2023年7月18日 08:10:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76708797.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定