找到最大的外部时间跨度并追加?

huangapple go评论130阅读模式
英文:

How to find maximum outer time span and append?

问题

我想为每个ID找到在00:00:00之前的最早测量时间和在00:00:00之后的最晚测量时间。这样我可以选择最大的重叠开始和结束时间。以下是示例数据:

  1. import pandas as pd
  2. # 创建示例数据框
  3. df = pd.DataFrame({'ID': [1,1,1,
  4. 2,2,2,
  5. 3,3,3],
  6. 'timestamp': ['2022-01-01 12:00:00', '2022-01-01 12:30:00', '2022-01-01 13:00:00',
  7. '2022-01-01 12:02:00', '2022-01-01 12:22:00', '2022-01-01 12:42:00',
  8. '2022-01-01 12:45:00', '2022-01-01 12:55:00', '2022-01-01 13:05:00'],
  9. 'VALUE1': [50, 80, 65,
  10. 61,83,63,
  11. 64,85,66],
  12. 'gender': ['m', 'm', 'm',
  13. 'f', 'f', 'f',
  14. 'm', 'm', 'm',],
  15. 'age': [7,7,7,
  16. 8,8,8,
  17. 6,6,6, ]
  18. })

我的期望是找到类似于图片中的T_start和T_end(在同一天的24小时测量中)。然后,附加(类似虚线)最后或第一个测量的相同值。请参考图片,因为我的问题描述可能令人困惑。

表格数据是现有数据:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 65 2022-01-01 13:00:00 m 7
2 61 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 63 2022-01-01 12:42:00 f 8
3 64 2022-01-01 12:45:00 m 6
3 85 2022-01-01 12:55:00 m 6
3 66 2022-01-01 13:05:00 m 6

我的期望是:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 65 2022-01-01 13:00:00 m 7
1 前一条记录的值=65 2022-01-01 13:05:00 m 7
2 下一条记录的值=61 2022-01-01 12:00:00 f 8
2 61 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 63 2022-01-01 12:42:00 f 8
2 前一条记录的值 2022-01-01 12:45:00 f 8
2 前一条记录的值 2022-01-01 12:55:00 f 8
2 前一条记录的值 2022-01-01 13:00:00 f 8
2 前一条记录的值 2022-01-01 13:05:00 f 8
3 下一条记录的值 2022-01-01 12:00:00 m 6
3 下一条记录的值 2022-01-01 12:02:00 m 6
3 下一条记录的值 2022-01-01 12:22:00 m 6
3 下一条记录的值=64 2022-01-01 12:42:00 m 6
3 64 2022-01-01 12:45:00 m 6
3 85 2022-01-01 12:55:00 m 6
3 66 2022-01-01 13:05:00 m 6
英文:

I would like to find for each ID, earliest measurement time before 00:00:00 and latest measurement time after 00:00:00. So that I can choose maximum overlapping start and ending time. Here is the sample data:

  1. import pandas as pd
  2. # create sample dataframe
  3. df = pd.DataFrame({'ID': [1,1,1,
  4. 2,2,2,
  5. 3,3,3],
  6. 'timestamp': ['2022-01-01 12:00:00', '2022-01-01 12:30:00', '2022-01-01 13:00:00',
  7. '2022-01-01 12:02:00', '2022-01-01 12:22:00', '2022-01-01 12:42:00',
  8. '2022-01-01 12:45:00', '2022-01-01 12:55:00', '2022-01-01 13:05:00'],
  9. 'VALUE1': [50, 80, 65,
  10. 61,83,63,
  11. 64,85,66],
  12. 'gender': ['m', 'm', 'm',
  13. 'f', 'f', 'f',
  14. 'm', 'm', 'm',],
  15. 'age': [7,7,7,
  16. 8,8,8,
  17. 6,6,6, ]
  18. })

My expectation is to find T_start and T_end like (for a same day 24h measurement) in the picture. And then, append (like dashed lines) same value of last or first measurement. Please refer to the drawing since my wording of the problem might be confusing:找到最大的外部时间跨度并追加?

What I have:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 65 2022-01-01 13:00:00 m 7
2 61 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 63 2022-01-01 12:42:00 f 8
3 64 2022-01-01 12:45:00 m 6
3 85 2022-01-01 12:55:00 m 6
3 66 2022-01-01 13:05:00 m 6

What I expect:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 65 2022-01-01 13:00:00 m 7
1 ValueofPreviousEntry=65 2022-01-01 13:05:00 m 7
2 ValueofNextEntry=61 2022-01-01 12:00:00 f 8
2 61 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 63 2022-01-01 12:42:00 f 8
2 ValueofPreviousEntry 2022-01-01 12:45:00 f 8
2 ValueofPreviousEntry 2022-01-01 12:55:00 f 8
2 ValueofPreviousEntry 2022-01-01 13:00:00 f 8
2 ValueofPreviousEntry 2022-01-01 13:05:00 f 8
3 ValueofNextEntry 2022-01-01 12:00:00 m 6
3 ValueofNextEntry 2022-01-01 12:02:00 m 6
3 ValueofNextEntry 2022-01-01 12:22:00 m 6
3 ValueofNextEntry=64 2022-01-01 12:42:00 m 6
3 64 2022-01-01 12:45:00 m 6
3 85 2022-01-01 12:55:00 m 6
3 66 2022-01-01 13:05:00 m 6

答案1

得分: 1

以下是使用Pandas进行操作的一种方法时间戳

  1. # 格式化数值
  2. df["timestamp"] = pd.to_datetime(df["timestamp"], infer_datetime_format=True)
  3. # 遍历时间戳以查找每个年龄子数据帧中丢失的时间戳
  4. dfs = []
  5. for age in df["age"].unique():
  6. tmp = df.loc[df["age"] == age, :].reset_index(drop=True)
  7. min_timestamp = tmp["timestamp"].min()
  8. max_timestamp = tmp["timestamp"].max()
  9. for timestamp in df["timestamp"].unique():
  10. if (
  11. pd.Timestamp(timestamp) < min_timestamp
  12. or pd.Timestamp(timestamp) > max_timestamp
  13. ):
  14. tmp.loc[tmp.shape[0], :] = [pd.NA, timestamp, pd.NA, pd.NA, pd.NA]
  15. dfs.append(
  16. tmp.sort_values("timestamp").fillna(method="bfill").fillna(method="ffill")
  17. )
  18. # 将子数据帧连接成一个
  19. new_df = pd.concat(dfs, ignore_index=True).astype({"VALUE1": int, "age": int})

然后:

  1. print(new_df)
  2. # 输出
  3. ID timestamp VALUE1 gender age
  4. 0 1.0 2022-01-01 12:00:00 50 m 7
  5. 1 1.0 2022-01-01 12:30:00 80 m 7
  6. 2 1.0 2022-01-01 13:00:00 65 m 7
  7. 3 1.0 2022-01-01 13:05:00 65 m 7
  8. 4 2.0 2022-01-01 12:00:00 61 f 8
  9. 5 2.0 2022-01-01 12:02:00 61 f 8
  10. 6 2.0 2022-01-01 12:22:00 83 f 8
  11. 7 2.0 2022-01-01 12:42:00 63 f 8
  12. 8 2.0 2022-01-01 12:45:00 63 f 8
  13. 9 2.0 2022-01-01 12:55:00 63 f 8
  14. 10 2.0 2022-01-01 13:00:00 63 f 8
  15. 11 2.0 2022-01-01 13:05:00 63 f 8
  16. 12 3.0 2022-01-01 12:00:00 64 m 6
  17. 13 3.0 2022-01-01 12:02:00 64 m 6
  18. 14 3.0 2022-01-01 12:22:00 64 m 6
  19. 15 3.0 2022-01-01 12:30:00 64 m 6
  20. 16 3.0 2022-01-01 12:42:00 64 m 6
  21. 17 3.0 2022-01-01 12:45:00 64 m 6
  22. 18 3.0 2022-01-01 12:55:00 85 m 6
  23. 19 3.0 2022-01-01 13:05:00 66 m 6
英文:

Here is one way to do it with Pandas Timestamp:

  1. # Format values
  2. df[&quot;timestamp&quot;] = pd.to_datetime(df[&quot;timestamp&quot;], infer_datetime_format=True)
  3. # Iterate on timestamps to find missing ones for each subdataframes per age
  4. dfs = []
  5. for age in df[&quot;age&quot;].unique():
  6. tmp = df.loc[df[&quot;age&quot;] == age, :].reset_index(drop=True)
  7. min_timestamp = tmp[&quot;timestamp&quot;].min()
  8. max_timestamp = tmp[&quot;timestamp&quot;].max()
  9. for timestamp in df[&quot;timestamp&quot;].unique():
  10. if (
  11. pd.Timestamp(timestamp) &lt; min_timestamp
  12. or pd.Timestamp(timestamp) &gt; max_timestamp
  13. ):
  14. tmp.loc[tmp.shape[0], :] = [pd.NA, timestamp, pd.NA, pd.NA, pd.NA]
  15. dfs.append(
  16. tmp.sort_values(&quot;timestamp&quot;).fillna(method=&quot;bfill&quot;).fillna(method=&quot;ffill&quot;)
  17. )
  18. # Concatenate sub dataframes into one
  19. new_df = pd.concat(dfs, ignore_index=True).astype({&quot;VALUE1&quot;: int, &quot;age&quot;: int})

Then:

  1. print(new_df)
  2. # Output
  3. ID timestamp VALUE1 gender age
  4. 0 1.0 2022-01-01 12:00:00 50 m 7
  5. 1 1.0 2022-01-01 12:30:00 80 m 7
  6. 2 1.0 2022-01-01 13:00:00 65 m 7
  7. 3 1.0 2022-01-01 13:05:00 65 m 7
  8. 4 2.0 2022-01-01 12:00:00 61 f 8
  9. 5 2.0 2022-01-01 12:02:00 61 f 8
  10. 6 2.0 2022-01-01 12:22:00 83 f 8
  11. 7 2.0 2022-01-01 12:42:00 63 f 8
  12. 8 2.0 2022-01-01 12:45:00 63 f 8
  13. 9 2.0 2022-01-01 12:55:00 63 f 8
  14. 10 2.0 2022-01-01 13:00:00 63 f 8
  15. 11 2.0 2022-01-01 13:05:00 63 f 8
  16. 12 3.0 2022-01-01 12:00:00 64 m 6
  17. 13 3.0 2022-01-01 12:02:00 64 m 6
  18. 14 3.0 2022-01-01 12:22:00 64 m 6
  19. 15 3.0 2022-01-01 12:30:00 64 m 6
  20. 16 3.0 2022-01-01 12:42:00 64 m 6
  21. 17 3.0 2022-01-01 12:45:00 64 m 6
  22. 18 3.0 2022-01-01 12:55:00 85 m 6
  23. 19 3.0 2022-01-01 13:05:00 66 m 6

答案2

得分: 1

以下是翻译好的代码部分:

  1. # 准备和重塑DataFrame
  2. df['timestamp'] = pd.to_datetime(df['timestamp'])
  3. df["gender"] = df.gender.replace({'f': 0, 'm': 1})
  4. df = df.pivot_table(index='timestamp', columns='ID')
  5. # 填充外部缺失值
  6. df = df.groupby(level=1, axis=1, group_keys=False).apply(lambda x: x.loc[x.first_valid_index():x.last_valid_index()].reindex(x.index, method="nearest"))
  7. # 恢复原始形状
  8. df = df.stack(dropna=True).swaplevel(0, 1, axis=0).sort_index()
  9. df = df.replace({"gender": {0: 'f', 1: 'm'}}).convert_dtypes().reset_index()

希望这对你有所帮助。如果你有任何其他问题,可以继续提问。

英文:

Solution

The following code uses only pandas functions without loops to solve the problem.

  1. # Prepare and reshape DataFrame
  2. df[&#39;timestamp&#39;] = pd.to_datetime(df[&#39;timestamp&#39;])
  3. df[&quot;gender&quot;] = df.gender.replace({&#39;f&#39;: 0, &#39;m&#39;: 1})
  4. df = df.pivot_table(index = &#39;timestamp&#39;, columns = &#39;ID&#39;)
  5. # Fill outer missing values
  6. df = df.groupby(level = 1, axis = 1, group_keys = False
  7. ).apply(lambda x: x.loc[x.first_valid_index():x.last_valid_index()]
  8. .reindex(x.index, method = &quot;nearest&quot;))
  9. # Bring back to original shape
  10. df = df.stack(dropna = True).swaplevel(0, 1, axis = 0).sort_index()
  11. df = df.replace({&quot;gender&quot;: {0: &#39;f&#39;, 1: &#39;m&#39;}}).convert_dtypes().reset_index()

This return the following DataFrame:

  1. &gt;&gt;&gt; df
  2. ID timestamp VALUE1 age gender
  3. 0 1 2022-01-01 12:00:00 50 7 m
  4. 1 1 2022-01-01 12:30:00 80 7 m
  5. 2 1 2022-01-01 13:00:00 65 7 m
  6. 3 1 2022-01-01 13:05:00 65 7 m
  7. 4 2 2022-01-01 12:00:00 61 8 f
  8. 5 2 2022-01-01 12:02:00 61 8 f
  9. 6 2 2022-01-01 12:22:00 83 8 f
  10. 7 2 2022-01-01 12:42:00 63 8 f
  11. 8 2 2022-01-01 12:45:00 63 8 f
  12. 9 2 2022-01-01 12:55:00 63 8 f
  13. 10 2 2022-01-01 13:00:00 63 8 f
  14. 11 2 2022-01-01 13:05:00 63 8 f
  15. 12 3 2022-01-01 12:00:00 64 6 m
  16. 13 3 2022-01-01 12:02:00 64 6 m
  17. 14 3 2022-01-01 12:22:00 64 6 m
  18. 15 3 2022-01-01 12:30:00 64 6 m
  19. 16 3 2022-01-01 12:42:00 64 6 m
  20. 17 3 2022-01-01 12:45:00 64 6 m
  21. 18 3 2022-01-01 12:55:00 85 6 m
  22. 19 3 2022-01-01 13:05:00 66 6 m

The df.groupby() construct can be replaced by df = df.ffill().bfill(), if not only the outer missing values but all missing values should be filled with neighboring values.


In an older solution and in the long answer below instead of df.groupby().apply() there is a loop for the df.groupby() object:

  1. # Fill outer missing values &amp; drop inner missing values
  2. for gid, gdf in df.groupby(level = 1, axis = 1):
  3. tmin, tmax = gdf.dropna().index[[0, -1]] # get first &amp; last index
  4. mask = gdf.index.map(lambda x: tmin &lt;= x &lt;= tmax and x not in gdf.dropna().index)
  5. df.loc[:, pd.IndexSlice[:, gid]] = gdf[~mask].ffill().bfill()

Long Answer

  1. import pandas as pd
  2. import seaborn as sns
  3. # Convert timestamp to datetime
  4. df[&#39;timestamp&#39;] = pd.to_datetime(df[&#39;timestamp&#39;])
  5. # replace f/m with 0/1 (because pivot_table cannot aggregate type object)
  6. df[&quot;gender&quot;] = df.gender.replace({&#39;f&#39;: 0, &#39;m&#39;: 1})
  7. # Pivot table with timestamp as index
  8. df = df.pivot_table(index = &#39;timestamp&#39;, columns = &#39;ID&#39;)

The DataFrame is pivoted and the columns are a MultiIndex (with "ID" as 2nd level):

  1. &gt;&gt;&gt; print(&quot;Pivoted DataFrame:&quot;, df, sep = &quot;\n&quot;)
  2. Pivoted DataFrame:
  3. VALUE1 age gender
  4. ID 1 2 3 1 2 3 1 2 3
  5. timestamp
  6. 2022-01-01 12:00:00 50.0 NaN NaN 7.0 NaN NaN 1.0 NaN NaN
  7. 2022-01-01 12:02:00 NaN 61.0 NaN NaN 8.0 NaN NaN 0.0 NaN
  8. 2022-01-01 12:22:00 NaN 83.0 NaN NaN 8.0 NaN NaN 0.0 NaN
  9. 2022-01-01 12:30:00 80.0 NaN NaN 7.0 NaN NaN 1.0 NaN NaN
  10. 2022-01-01 12:42:00 NaN 63.0 NaN NaN 8.0 NaN NaN 0.0 NaN
  11. 2022-01-01 12:45:00 NaN NaN 64.0 NaN NaN 6.0 NaN NaN 1.0
  12. 2022-01-01 12:55:00 NaN NaN 85.0 NaN NaN 6.0 NaN NaN 1.0
  13. 2022-01-01 13:00:00 65.0 NaN NaN 7.0 NaN NaN 1.0 NaN NaN
  14. 2022-01-01 13:05:00 NaN NaN 66.0 NaN NaN 6.0 NaN NaN 1.0

This would allow to easily interpolate or fill the missing values, if we would want to fill all missing value and not only the outer ones:

  1. df = df.ffill().bfill()

To only fill the missing values in the outer time span, we need a few more lines

  1. # Iterate over IDs (which are the 2nd level)
  2. for gid, gdf in df.groupby(level = 1, axis = 1):
  3. # Get first and last non-NaN timestamp. Sort in case the index is not sorted.
  4. tmin, tmax = gdf.dropna().index.sort_values()[[0, -1]]
  5. # Creating a boolean mask which can be negated with &quot;~&quot;
  6. mask = gdf.index.map(lambda x: tmin &lt;= x &lt;= tmax and x not in gdf.VALUE1.dropna().index)
  7. # Fill missing values with the neighboring values &amp; drop masked values
  8. df.loc[:, pd.IndexSlice[:, gid]] = gdf[~mask].ffill().bfill()

Now the only thing we need to do is to bring the DataFrame back to its original shape:

  1. df = df.stack().swaplevel(0, 1, axis = 0).sort_index(axis = 0).astype(int)
  2. df = df.replace({&quot;gender&quot;: {0: &#39;f&#39;, 1: &#39;m&#39;}}).astype({&quot;gender&quot;: &quot;category&quot;})

This gives us the final DataFrame with a MuliIndex, where the timestamp is the 2nd level:

  1. &gt;&gt;&gt; print(&quot;MultiIndex DataFrame:&quot;, df, sep = &quot;\n&quot;)
  2. MultiIndex DataFrame:
  3. VALUE1 age gender
  4. ID timestamp
  5. 1 2022-01-01 12:00:00 50 7 m
  6. 2022-01-01 12:30:00 80 7 m
  7. 2022-01-01 13:00:00 65 7 m
  8. 2022-01-01 13:05:00 65 7 m
  9. 2 2022-01-01 12:00:00 61 8 f
  10. 2022-01-01 12:02:00 61 8 f
  11. 2022-01-01 12:22:00 83 8 f
  12. 2022-01-01 12:42:00 63 8 f
  13. 2022-01-01 12:45:00 63 8 f
  14. 2022-01-01 12:55:00 63 8 f
  15. 2022-01-01 13:00:00 63 8 f
  16. 2022-01-01 13:05:00 63 8 f
  17. 3 2022-01-01 12:00:00 64 6 m
  18. 2022-01-01 12:02:00 64 6 m
  19. 2022-01-01 12:22:00 64 6 m
  20. 2022-01-01 12:30:00 64 6 m
  21. 2022-01-01 12:42:00 64 6 m
  22. 2022-01-01 12:45:00 64 6 m
  23. 2022-01-01 12:55:00 85 6 m
  24. 2022-01-01 13:05:00 66 6 m

To get rid of the MultiIndex, we can use reset_index():

  1. df = df.reset_index()

To plot the data we can use either pandas builtin plot function or seaborn, which usually produces slightly better plots:

  1. # df.VALUE1.unstack().T.plot(marker = &quot;o&quot;, linestyle = &quot;-&quot;)
  2. sns.lineplot(data = df, x = &quot;timestamp&quot;, y = &quot;VALUE1&quot;,
  3. hue = &quot;ID&quot;, style = &quot;gender&quot;, palette = &quot;tab10&quot;)

找到最大的外部时间跨度并追加?

huangapple
  • 本文由 发表于 2023年3月7日 04:50:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655718.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定