如何使用Python填充所有ID的缺失日期

huangapple go评论109阅读模式
英文:

How to fill in missing dates using python for all ids

问题

我有一个名为A的pandas数据帧,其中包含ID、date_yyyymmdd、amount和hours,如下所示。并非所有的日历日期都有数据。

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230103 1791.29 13
2 20230101 2516.84 15
2 20230105 3046.08 5
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1

我想要填充在两个变量start_date和end_date之间缺失的日历日期,并生成另一个名为B的数据帧,如下所示,并将这些日期的amount和hours填充为0。在下面的示例中,开始日期是20230101,结束日期是20230105。我找到了一段使用日期作为索引并填充缺失值的代码,但我不认为它适用于我的情况。我想要为每个id填充日期。我该如何实现这个目标?谢谢。

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230102 0 0
1 20230103 1791.29 13
1 20230104 0 0
1 20230105 0 0
2 20230101 2516.84 15
2 20230102 0 0
2 20230103 0 0
2 20230104 0 0
2 20230105 3046.08 5
3 20230101 0 0
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1
3 20230105 0 0
英文:

I have a pandas dataframe A with ID, date_yyyymmdd, amount and hours as shown below. Not all calendar dates are populated.

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230103 1791.29 13
2 20230101 2516.84 15
2 20230105 3046.08 5
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1

I would like to fill in missing calendar dates between two variables start_date and end_date and produce another dataframe B as shown below and populate amount and hours as 0s for those dates. In the example below the start date is 20230101 and end date is 20230105. I found a code that uses date as index and fills in missing value. I don't think it will work in my case. I want to fill dates for each id. How can I accomplish this? Thanks.

id date_yyyymmdd amount hours
1 20230101 1428.95 11
1 20230102 0 0
1 20230103 1791.29 13
1 20230104 0 0
1 20230105 0 0
2 20230101 2516.84 15
2 20230102 0 0
2 20230103 0 0
2 20230104 0 0
2 20230105 3046.08 5
3 20230101 0 0
3 20230102 7137.92 11
3 20230103 1104.35 1
3 20230104 25 1
3 20230105 0 0

答案1

得分: 1

这是一种通过构建新的MultiIndex并使用它来reindex你的DataFrame的方法。

cols = ['id', 'date_yyyymmdd']

start_date = '1/1/2023'
end_date = '1/5/2023'

df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='%Y%m%d')

df = (df.set_index(cols)
      .reindex(pd.MultiIndex.from_product([df['id'].unique(), pd.date_range(start_date, end_date, freq='D')], names=cols))
      .fillna(0)
      .sort_index()
      .reset_index())

输出:

   id date_yyyymmdd   amount  hours
0   1    2023-01-01  1428.95   11.0
1   1    2023-01-02     0.00    0.0
2   1    2023-01-03  1791.29   13.0
3   1    2023-01-04     0.00    0.0
4   1    2023-01-05     0.00    0.0
5   2    2023-01-01  2516.84   15.0
6   2    2023-01-02     0.00    0.0
7   2    2023-01-03     0.00    0.0
8   2    2023-01-04     0.00    0.0
9   2    2023-01-05  3046.08    5.0
10  3    2023-01-01     0.00    0.0
11  3    2023-01-02  7137.92   11.0
12  3    2023-01-03  1104.35    1.0
13  3    2023-01-04    25.00    1.0
14  3    2023-01-05     0.00    0.0
英文:

Here is a way by constructing a new MultiIndex, and using that to reindex your df.

cols = ['id','date_yyyymmdd']

start_date = '1/1/2023'
end_date = '1/5/2023'

df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'],format = '%Y%m%d')

df = (df.set_index(cols)
      .reindex(pd.MultiIndex.from_product([df['id'].unique(),pd.date_range(start_date,end_date,freq='D')],names = cols))
      .fillna(0)
      .sort_index()
      .reset_index())

Output:

    id date_yyyymmdd   amount  hours
0    1    2023-01-01  1428.95   11.0
1    1    2023-01-02     0.00    0.0
2    1    2023-01-03  1791.29   13.0
3    1    2023-01-04     0.00    0.0
4    1    2023-01-05     0.00    0.0
5    2    2023-01-01  2516.84   15.0
6    2    2023-01-02     0.00    0.0
7    2    2023-01-03     0.00    0.0
8    2    2023-01-04     0.00    0.0
9    2    2023-01-05  3046.08    5.0
10   3    2023-01-01     0.00    0.0
11   3    2023-01-02  7137.92   11.0
12   3    2023-01-03  1104.35    1.0
13   3    2023-01-04    25.00    1.0
14   3    2023-01-05     0.00    0.0

答案2

得分: 0

以下是您提供的代码的翻译部分:

df["date_yyyymmdd"] = pd.to_datetime(df["date_yyyymmdd"], format="%Y%m%d")
r = pd.date_range(df["date_yyyymmdd"].min(), df["date_yyyymmdd"].max())

df = (
    df.groupby("id", group_keys=False)
    .apply(
        lambda x: (newdf := x.set_index("date_yyyymmdd").reindex(r)).assign(
            id=newdf["id"].ffill().bfill()
        )
    )
    .reset_index()
    .fillna(0)
)
df["id"] = df["id"].astype(int)

print(df)

打印结果如下:

        index  id   amount  hours
0  2023-01-01   1  1428.95   11.0
1  2023-01-02   1     0.00    0.0
2  2023-01-03   1  1791.29   13.0
3  2023-01-04   1     0.00    0.0
4  2023-01-05   1     0.00    0.0
5  2023-01-01   2  2516.84   15.0
6  2023-01-02   2     0.00    0.0
7  2023-01-03   2     0.00    0.0
8  2023-01-04   2     0.00    0.0
9  2023-01-05   2  3046.08    5.0
10 2023-01-01   3     0.00    0.0
11 2023-01-02   3  7137.92   11.0
12 2023-01-03   3  1104.35    1.0
13 2023-01-04   3    25.00    1.0
14 2023-01-05   3     0.00    0.0

请注意,这是您提供的代码的翻译,没有包括任何其他内容。

英文:

Try:

df["date_yyyymmdd"] = pd.to_datetime(df["date_yyyymmdd"], format="%Y%m%d")
r = pd.date_range(df["date_yyyymmdd"].min(), df["date_yyyymmdd"].max())

df = (
    df.groupby("id", group_keys=False)
    .apply(
        lambda x: (newdf := x.set_index("date_yyyymmdd").reindex(r)).assign(
            id=newdf["id"].ffill().bfill()
        )
    )
    .reset_index()
    .fillna(0)
)
df["id"] = df["id"].astype(int)

print(df)

Prints:

        index  id   amount  hours
0  2023-01-01   1  1428.95   11.0
1  2023-01-02   1     0.00    0.0
2  2023-01-03   1  1791.29   13.0
3  2023-01-04   1     0.00    0.0
4  2023-01-05   1     0.00    0.0
5  2023-01-01   2  2516.84   15.0
6  2023-01-02   2     0.00    0.0
7  2023-01-03   2     0.00    0.0
8  2023-01-04   2     0.00    0.0
9  2023-01-05   2  3046.08    5.0
10 2023-01-01   3     0.00    0.0
11 2023-01-02   3  7137.92   11.0
12 2023-01-03   3  1104.35    1.0
13 2023-01-04   3    25.00    1.0
14 2023-01-05   3     0.00    0.0

答案3

得分: 0

一种选择是使用 pyjanitorcomplete 函数:

# pip install pyjanitor
import janitor
import pandas as pd

df = pd.read_clipboard()
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='ISO8601')
# 创建包含所有可能日期的变量
dates = {"date_yyyymmdd": pd.date_range("2023-01-01", "2023-01-05", freq="D")}
df.complete('id', dates, fill_value=0)
    id date_yyyymmdd   amount  hours
0    1    2023-01-01  1428.95     11
1    1    2023-01-02     0.00      0
2    1    2023-01-03  1791.29     13
3    1    2023-01-04     0.00      0
4    1    2023-01-05     0.00      0
5    2    2023-01-01  2516.84     15
6    2    2023-01-02     0.00      0
7    2    2023-01-03     0.00      0
8    2    2023-01-04     0.00      0
9    2    2023-01-05  3046.08      5
10   3    2023-01-01     0.00      0
11   3    2023-01-02  7137.92     11
12   3    2023-01-03  1104.35      1
13   3    2023-01-04    25.00      1
14   3    2023-01-05     0.00      0

请注意,这是关于如何使用 pyjanitor 的 complete 函数来处理数据的示例代码。

英文:

One option is with pyjanitor's complete function:

# pip install pyjanitor
import janitor
import pandas as pd

df = pd.read_clipboard()
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'],format = 'ISO8601')
# create variable containing all possible dates
dates = {"date_yyyymmdd": pd.date_range("2023-01-01", "2023-01-05", freq="D")}
df.complete('id', dates, fill_value=0)
    id date_yyyymmdd   amount  hours
0    1    2023-01-01  1428.95     11
1    1    2023-01-02     0.00      0
2    1    2023-01-03  1791.29     13
3    1    2023-01-04     0.00      0
4    1    2023-01-05     0.00      0
5    2    2023-01-01  2516.84     15
6    2    2023-01-02     0.00      0
7    2    2023-01-03     0.00      0
8    2    2023-01-04     0.00      0
9    2    2023-01-05  3046.08      5
10   3    2023-01-01     0.00      0
11   3    2023-01-02  7137.92     11
12   3    2023-01-03  1104.35      1
13   3    2023-01-04    25.00      1
14   3    2023-01-05     0.00      0

huangapple
  • 本文由 发表于 2023年8月5日 02:30:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76838402.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定