英文:
How to fill in missing dates using python for all ids
问题
我有一个名为A的pandas数据帧,其中包含ID、date_yyyymmdd、amount和hours,如下所示。并非所有的日历日期都有数据。
id | date_yyyymmdd | amount | hours |
---|---|---|---|
1 | 20230101 | 1428.95 | 11 |
1 | 20230103 | 1791.29 | 13 |
2 | 20230101 | 2516.84 | 15 |
2 | 20230105 | 3046.08 | 5 |
3 | 20230102 | 7137.92 | 11 |
3 | 20230103 | 1104.35 | 1 |
3 | 20230104 | 25 | 1 |
我想要填充在两个变量start_date和end_date之间缺失的日历日期,并生成另一个名为B的数据帧,如下所示,并将这些日期的amount和hours填充为0。在下面的示例中,开始日期是20230101,结束日期是20230105。我找到了一段使用日期作为索引并填充缺失值的代码,但我不认为它适用于我的情况。我想要为每个id填充日期。我该如何实现这个目标?谢谢。
id | date_yyyymmdd | amount | hours |
---|---|---|---|
1 | 20230101 | 1428.95 | 11 |
1 | 20230102 | 0 | 0 |
1 | 20230103 | 1791.29 | 13 |
1 | 20230104 | 0 | 0 |
1 | 20230105 | 0 | 0 |
2 | 20230101 | 2516.84 | 15 |
2 | 20230102 | 0 | 0 |
2 | 20230103 | 0 | 0 |
2 | 20230104 | 0 | 0 |
2 | 20230105 | 3046.08 | 5 |
3 | 20230101 | 0 | 0 |
3 | 20230102 | 7137.92 | 11 |
3 | 20230103 | 1104.35 | 1 |
3 | 20230104 | 25 | 1 |
3 | 20230105 | 0 | 0 |
英文:
I have a pandas dataframe A with ID, date_yyyymmdd, amount and hours as shown below. Not all calendar dates are populated.
id | date_yyyymmdd | amount | hours |
---|---|---|---|
1 | 20230101 | 1428.95 | 11 |
1 | 20230103 | 1791.29 | 13 |
2 | 20230101 | 2516.84 | 15 |
2 | 20230105 | 3046.08 | 5 |
3 | 20230102 | 7137.92 | 11 |
3 | 20230103 | 1104.35 | 1 |
3 | 20230104 | 25 | 1 |
I would like to fill in missing calendar dates between two variables start_date and end_date and produce another dataframe B as shown below and populate amount and hours as 0s for those dates. In the example below the start date is 20230101 and end date is 20230105. I found a code that uses date as index and fills in missing value. I don't think it will work in my case. I want to fill dates for each id. How can I accomplish this? Thanks.
id | date_yyyymmdd | amount | hours |
---|---|---|---|
1 | 20230101 | 1428.95 | 11 |
1 | 20230102 | 0 | 0 |
1 | 20230103 | 1791.29 | 13 |
1 | 20230104 | 0 | 0 |
1 | 20230105 | 0 | 0 |
2 | 20230101 | 2516.84 | 15 |
2 | 20230102 | 0 | 0 |
2 | 20230103 | 0 | 0 |
2 | 20230104 | 0 | 0 |
2 | 20230105 | 3046.08 | 5 |
3 | 20230101 | 0 | 0 |
3 | 20230102 | 7137.92 | 11 |
3 | 20230103 | 1104.35 | 1 |
3 | 20230104 | 25 | 1 |
3 | 20230105 | 0 | 0 |
答案1
得分: 1
这是一种通过构建新的MultiIndex
并使用它来reindex
你的DataFrame的方法。
cols = ['id', 'date_yyyymmdd']
start_date = '1/1/2023'
end_date = '1/5/2023'
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='%Y%m%d')
df = (df.set_index(cols)
.reindex(pd.MultiIndex.from_product([df['id'].unique(), pd.date_range(start_date, end_date, freq='D')], names=cols))
.fillna(0)
.sort_index()
.reset_index())
输出:
id date_yyyymmdd amount hours
0 1 2023-01-01 1428.95 11.0
1 1 2023-01-02 0.00 0.0
2 1 2023-01-03 1791.29 13.0
3 1 2023-01-04 0.00 0.0
4 1 2023-01-05 0.00 0.0
5 2 2023-01-01 2516.84 15.0
6 2 2023-01-02 0.00 0.0
7 2 2023-01-03 0.00 0.0
8 2 2023-01-04 0.00 0.0
9 2 2023-01-05 3046.08 5.0
10 3 2023-01-01 0.00 0.0
11 3 2023-01-02 7137.92 11.0
12 3 2023-01-03 1104.35 1.0
13 3 2023-01-04 25.00 1.0
14 3 2023-01-05 0.00 0.0
英文:
Here is a way by constructing a new MultiIndex
, and using that to reindex
your df.
cols = ['id','date_yyyymmdd']
start_date = '1/1/2023'
end_date = '1/5/2023'
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'],format = '%Y%m%d')
df = (df.set_index(cols)
.reindex(pd.MultiIndex.from_product([df['id'].unique(),pd.date_range(start_date,end_date,freq='D')],names = cols))
.fillna(0)
.sort_index()
.reset_index())
Output:
id date_yyyymmdd amount hours
0 1 2023-01-01 1428.95 11.0
1 1 2023-01-02 0.00 0.0
2 1 2023-01-03 1791.29 13.0
3 1 2023-01-04 0.00 0.0
4 1 2023-01-05 0.00 0.0
5 2 2023-01-01 2516.84 15.0
6 2 2023-01-02 0.00 0.0
7 2 2023-01-03 0.00 0.0
8 2 2023-01-04 0.00 0.0
9 2 2023-01-05 3046.08 5.0
10 3 2023-01-01 0.00 0.0
11 3 2023-01-02 7137.92 11.0
12 3 2023-01-03 1104.35 1.0
13 3 2023-01-04 25.00 1.0
14 3 2023-01-05 0.00 0.0
答案2
得分: 0
以下是您提供的代码的翻译部分:
df["date_yyyymmdd"] = pd.to_datetime(df["date_yyyymmdd"], format="%Y%m%d")
r = pd.date_range(df["date_yyyymmdd"].min(), df["date_yyyymmdd"].max())
df = (
df.groupby("id", group_keys=False)
.apply(
lambda x: (newdf := x.set_index("date_yyyymmdd").reindex(r)).assign(
id=newdf["id"].ffill().bfill()
)
)
.reset_index()
.fillna(0)
)
df["id"] = df["id"].astype(int)
print(df)
打印结果如下:
index id amount hours
0 2023-01-01 1 1428.95 11.0
1 2023-01-02 1 0.00 0.0
2 2023-01-03 1 1791.29 13.0
3 2023-01-04 1 0.00 0.0
4 2023-01-05 1 0.00 0.0
5 2023-01-01 2 2516.84 15.0
6 2023-01-02 2 0.00 0.0
7 2023-01-03 2 0.00 0.0
8 2023-01-04 2 0.00 0.0
9 2023-01-05 2 3046.08 5.0
10 2023-01-01 3 0.00 0.0
11 2023-01-02 3 7137.92 11.0
12 2023-01-03 3 1104.35 1.0
13 2023-01-04 3 25.00 1.0
14 2023-01-05 3 0.00 0.0
请注意,这是您提供的代码的翻译,没有包括任何其他内容。
英文:
Try:
df["date_yyyymmdd"] = pd.to_datetime(df["date_yyyymmdd"], format="%Y%m%d")
r = pd.date_range(df["date_yyyymmdd"].min(), df["date_yyyymmdd"].max())
df = (
df.groupby("id", group_keys=False)
.apply(
lambda x: (newdf := x.set_index("date_yyyymmdd").reindex(r)).assign(
id=newdf["id"].ffill().bfill()
)
)
.reset_index()
.fillna(0)
)
df["id"] = df["id"].astype(int)
print(df)
Prints:
index id amount hours
0 2023-01-01 1 1428.95 11.0
1 2023-01-02 1 0.00 0.0
2 2023-01-03 1 1791.29 13.0
3 2023-01-04 1 0.00 0.0
4 2023-01-05 1 0.00 0.0
5 2023-01-01 2 2516.84 15.0
6 2023-01-02 2 0.00 0.0
7 2023-01-03 2 0.00 0.0
8 2023-01-04 2 0.00 0.0
9 2023-01-05 2 3046.08 5.0
10 2023-01-01 3 0.00 0.0
11 2023-01-02 3 7137.92 11.0
12 2023-01-03 3 1104.35 1.0
13 2023-01-04 3 25.00 1.0
14 2023-01-05 3 0.00 0.0
答案3
得分: 0
一种选择是使用 pyjanitor 的 complete 函数:
# pip install pyjanitor
import janitor
import pandas as pd
df = pd.read_clipboard()
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='ISO8601')
# 创建包含所有可能日期的变量
dates = {"date_yyyymmdd": pd.date_range("2023-01-01", "2023-01-05", freq="D")}
df.complete('id', dates, fill_value=0)
id date_yyyymmdd amount hours
0 1 2023-01-01 1428.95 11
1 1 2023-01-02 0.00 0
2 1 2023-01-03 1791.29 13
3 1 2023-01-04 0.00 0
4 1 2023-01-05 0.00 0
5 2 2023-01-01 2516.84 15
6 2 2023-01-02 0.00 0
7 2 2023-01-03 0.00 0
8 2 2023-01-04 0.00 0
9 2 2023-01-05 3046.08 5
10 3 2023-01-01 0.00 0
11 3 2023-01-02 7137.92 11
12 3 2023-01-03 1104.35 1
13 3 2023-01-04 25.00 1
14 3 2023-01-05 0.00 0
请注意,这是关于如何使用 pyjanitor 的 complete 函数来处理数据的示例代码。
英文:
One option is with pyjanitor's complete function:
# pip install pyjanitor
import janitor
import pandas as pd
df = pd.read_clipboard()
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'],format = 'ISO8601')
# create variable containing all possible dates
dates = {"date_yyyymmdd": pd.date_range("2023-01-01", "2023-01-05", freq="D")}
df.complete('id', dates, fill_value=0)
id date_yyyymmdd amount hours
0 1 2023-01-01 1428.95 11
1 1 2023-01-02 0.00 0
2 1 2023-01-03 1791.29 13
3 1 2023-01-04 0.00 0
4 1 2023-01-05 0.00 0
5 2 2023-01-01 2516.84 15
6 2 2023-01-02 0.00 0
7 2 2023-01-03 0.00 0
8 2 2023-01-04 0.00 0
9 2 2023-01-05 3046.08 5
10 3 2023-01-01 0.00 0
11 3 2023-01-02 7137.92 11
12 3 2023-01-03 1104.35 1
13 3 2023-01-04 25.00 1
14 3 2023-01-05 0.00 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论