如何使用Python填充所有ID的缺失日期

huangapple go评论134阅读模式
英文:

How to fill in missing dates using python for all ids

问题

我有一个名为A的pandas数据帧,其中包含ID、date_yyyymmdd、amount和hours,如下所示。并非所有的日历日期都有数据。

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230103 1791.29 13
2 20230101 2516.84 15
2 20230105 3046.08 5
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1

我想要填充在两个变量start_date和end_date之间缺失的日历日期,并生成另一个名为B的数据帧,如下所示,并将这些日期的amount和hours填充为0。在下面的示例中,开始日期是20230101,结束日期是20230105。我找到了一段使用日期作为索引并填充缺失值的代码,但我不认为它适用于我的情况。我想要为每个id填充日期。我该如何实现这个目标?谢谢。

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230102 0 0
1 20230103 1791.29 13
1 20230104 0 0
1 20230105 0 0
2 20230101 2516.84 15
2 20230102 0 0
2 20230103 0 0
2 20230104 0 0
2 20230105 3046.08 5
3 20230101 0 0
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1
3 20230105 0 0
英文:

I have a pandas dataframe A with ID, date_yyyymmdd, amount and hours as shown below. Not all calendar dates are populated.

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230103 1791.29 13
2 20230101 2516.84 15
2 20230105 3046.08 5
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1

I would like to fill in missing calendar dates between two variables start_date and end_date and produce another dataframe B as shown below and populate amount and hours as 0s for those dates. In the example below the start date is 20230101 and end date is 20230105. I found a code that uses date as index and fills in missing value. I don't think it will work in my case. I want to fill dates for each id. How can I accomplish this? Thanks.

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230102 0 0
1 20230103 1791.29 13
1 20230104 0 0
1 20230105 0 0
2 20230101 2516.84 15
2 20230102 0 0
2 20230103 0 0
2 20230104 0 0
2 20230105 3046.08 5
3 20230101 0 0
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1
3 20230105 0 0

答案1

得分: 1

这是一种通过构建新的MultiIndex并使用它来reindex你的DataFrame的方法。

  1. cols = ['id', 'date_yyyymmdd']
  2. start_date = '1/1/2023'
  3. end_date = '1/5/2023'
  4. df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='%Y%m%d')
  5. df = (df.set_index(cols)
  6. .reindex(pd.MultiIndex.from_product([df['id'].unique(), pd.date_range(start_date, end_date, freq='D')], names=cols))
  7. .fillna(0)
  8. .sort_index()
  9. .reset_index())

输出:

  1. id date_yyyymmdd amount hours
  2. 0 1 2023-01-01 1428.95 11.0
  3. 1 1 2023-01-02 0.00 0.0
  4. 2 1 2023-01-03 1791.29 13.0
  5. 3 1 2023-01-04 0.00 0.0
  6. 4 1 2023-01-05 0.00 0.0
  7. 5 2 2023-01-01 2516.84 15.0
  8. 6 2 2023-01-02 0.00 0.0
  9. 7 2 2023-01-03 0.00 0.0
  10. 8 2 2023-01-04 0.00 0.0
  11. 9 2 2023-01-05 3046.08 5.0
  12. 10 3 2023-01-01 0.00 0.0
  13. 11 3 2023-01-02 7137.92 11.0
  14. 12 3 2023-01-03 1104.35 1.0
  15. 13 3 2023-01-04 25.00 1.0
  16. 14 3 2023-01-05 0.00 0.0
英文:

Here is a way by constructing a new MultiIndex, and using that to reindex your df.

  1. cols = ['id','date_yyyymmdd']
  2. start_date = '1/1/2023'
  3. end_date = '1/5/2023'
  4. df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'],format = '%Y%m%d')
  5. df = (df.set_index(cols)
  6. .reindex(pd.MultiIndex.from_product([df['id'].unique(),pd.date_range(start_date,end_date,freq='D')],names = cols))
  7. .fillna(0)
  8. .sort_index()
  9. .reset_index())

Output:

  1. id date_yyyymmdd amount hours
  2. 0 1 2023-01-01 1428.95 11.0
  3. 1 1 2023-01-02 0.00 0.0
  4. 2 1 2023-01-03 1791.29 13.0
  5. 3 1 2023-01-04 0.00 0.0
  6. 4 1 2023-01-05 0.00 0.0
  7. 5 2 2023-01-01 2516.84 15.0
  8. 6 2 2023-01-02 0.00 0.0
  9. 7 2 2023-01-03 0.00 0.0
  10. 8 2 2023-01-04 0.00 0.0
  11. 9 2 2023-01-05 3046.08 5.0
  12. 10 3 2023-01-01 0.00 0.0
  13. 11 3 2023-01-02 7137.92 11.0
  14. 12 3 2023-01-03 1104.35 1.0
  15. 13 3 2023-01-04 25.00 1.0
  16. 14 3 2023-01-05 0.00 0.0

答案2

得分: 0

以下是您提供的代码的翻译部分:

  1. df["date_yyyymmdd"] = pd.to_datetime(df["date_yyyymmdd"], format="%Y%m%d")
  2. r = pd.date_range(df["date_yyyymmdd"].min(), df["date_yyyymmdd"].max())
  3. df = (
  4. df.groupby("id", group_keys=False)
  5. .apply(
  6. lambda x: (newdf := x.set_index("date_yyyymmdd").reindex(r)).assign(
  7. id=newdf["id"].ffill().bfill()
  8. )
  9. )
  10. .reset_index()
  11. .fillna(0)
  12. )
  13. df["id"] = df["id"].astype(int)
  14. print(df)

打印结果如下:

  1. index id amount hours
  2. 0 2023-01-01 1 1428.95 11.0
  3. 1 2023-01-02 1 0.00 0.0
  4. 2 2023-01-03 1 1791.29 13.0
  5. 3 2023-01-04 1 0.00 0.0
  6. 4 2023-01-05 1 0.00 0.0
  7. 5 2023-01-01 2 2516.84 15.0
  8. 6 2023-01-02 2 0.00 0.0
  9. 7 2023-01-03 2 0.00 0.0
  10. 8 2023-01-04 2 0.00 0.0
  11. 9 2023-01-05 2 3046.08 5.0
  12. 10 2023-01-01 3 0.00 0.0
  13. 11 2023-01-02 3 7137.92 11.0
  14. 12 2023-01-03 3 1104.35 1.0
  15. 13 2023-01-04 3 25.00 1.0
  16. 14 2023-01-05 3 0.00 0.0

请注意,这是您提供的代码的翻译,没有包括任何其他内容。

英文:

Try:

  1. df["date_yyyymmdd"] = pd.to_datetime(df["date_yyyymmdd"], format="%Y%m%d")
  2. r = pd.date_range(df["date_yyyymmdd"].min(), df["date_yyyymmdd"].max())
  3. df = (
  4. df.groupby("id", group_keys=False)
  5. .apply(
  6. lambda x: (newdf := x.set_index("date_yyyymmdd").reindex(r)).assign(
  7. id=newdf["id"].ffill().bfill()
  8. )
  9. )
  10. .reset_index()
  11. .fillna(0)
  12. )
  13. df["id"] = df["id"].astype(int)
  14. print(df)

Prints:

  1. index id amount hours
  2. 0 2023-01-01 1 1428.95 11.0
  3. 1 2023-01-02 1 0.00 0.0
  4. 2 2023-01-03 1 1791.29 13.0
  5. 3 2023-01-04 1 0.00 0.0
  6. 4 2023-01-05 1 0.00 0.0
  7. 5 2023-01-01 2 2516.84 15.0
  8. 6 2023-01-02 2 0.00 0.0
  9. 7 2023-01-03 2 0.00 0.0
  10. 8 2023-01-04 2 0.00 0.0
  11. 9 2023-01-05 2 3046.08 5.0
  12. 10 2023-01-01 3 0.00 0.0
  13. 11 2023-01-02 3 7137.92 11.0
  14. 12 2023-01-03 3 1104.35 1.0
  15. 13 2023-01-04 3 25.00 1.0
  16. 14 2023-01-05 3 0.00 0.0

答案3

得分: 0

一种选择是使用 pyjanitorcomplete 函数:

  1. # pip install pyjanitor
  2. import janitor
  3. import pandas as pd
  4. df = pd.read_clipboard()
  5. df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='ISO8601')
  6. # 创建包含所有可能日期的变量
  7. dates = {"date_yyyymmdd": pd.date_range("2023-01-01", "2023-01-05", freq="D")}
  8. df.complete('id', dates, fill_value=0)
  9. id date_yyyymmdd amount hours
  10. 0 1 2023-01-01 1428.95 11
  11. 1 1 2023-01-02 0.00 0
  12. 2 1 2023-01-03 1791.29 13
  13. 3 1 2023-01-04 0.00 0
  14. 4 1 2023-01-05 0.00 0
  15. 5 2 2023-01-01 2516.84 15
  16. 6 2 2023-01-02 0.00 0
  17. 7 2 2023-01-03 0.00 0
  18. 8 2 2023-01-04 0.00 0
  19. 9 2 2023-01-05 3046.08 5
  20. 10 3 2023-01-01 0.00 0
  21. 11 3 2023-01-02 7137.92 11
  22. 12 3 2023-01-03 1104.35 1
  23. 13 3 2023-01-04 25.00 1
  24. 14 3 2023-01-05 0.00 0

请注意,这是关于如何使用 pyjanitor 的 complete 函数来处理数据的示例代码。

英文:

One option is with pyjanitor's complete function:

  1. # pip install pyjanitor
  2. import janitor
  3. import pandas as pd
  4. df = pd.read_clipboard()
  5. df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'],format = 'ISO8601')
  6. # create variable containing all possible dates
  7. dates = {"date_yyyymmdd": pd.date_range("2023-01-01", "2023-01-05", freq="D")}
  8. df.complete('id', dates, fill_value=0)
  9. id date_yyyymmdd amount hours
  10. 0 1 2023-01-01 1428.95 11
  11. 1 1 2023-01-02 0.00 0
  12. 2 1 2023-01-03 1791.29 13
  13. 3 1 2023-01-04 0.00 0
  14. 4 1 2023-01-05 0.00 0
  15. 5 2 2023-01-01 2516.84 15
  16. 6 2 2023-01-02 0.00 0
  17. 7 2 2023-01-03 0.00 0
  18. 8 2 2023-01-04 0.00 0
  19. 9 2 2023-01-05 3046.08 5
  20. 10 3 2023-01-01 0.00 0
  21. 11 3 2023-01-02 7137.92 11
  22. 12 3 2023-01-03 1104.35 1
  23. 13 3 2023-01-04 25.00 1
  24. 14 3 2023-01-05 0.00 0

huangapple
  • 本文由 发表于 2023年8月5日 02:30:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76838402.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定