2023年8月5日 02:30:36go评论134阅读模式

英文:

How to fill in missing dates using python for all ids

问题

我有一个名为A的pandas数据帧，其中包含ID、date_yyyymmdd、amount和hours，如下所示。并非所有的日历日期都有数据。

id	date_yyyymmdd	amount	hours
1	20230101	1428.95	11
1	20230103	1791.29	13
2	20230101	2516.84	15
2	20230105	3046.08	5
3	20230102	7137.92	11
3	20230103	1104.35	1
3	20230104	25	1

我想要填充在两个变量start_date和end_date之间缺失的日历日期，并生成另一个名为B的数据帧，如下所示，并将这些日期的amount和hours填充为0。在下面的示例中，开始日期是20230101，结束日期是20230105。我找到了一段使用日期作为索引并填充缺失值的代码，但我不认为它适用于我的情况。我想要为每个id填充日期。我该如何实现这个目标？谢谢。

id	date_yyyymmdd	amount	hours
1	20230101	1428.95	11
1	20230102	0	0
1	20230103	1791.29	13
1	20230104	0	0
1	20230105	0	0
2	20230101	2516.84	15
2	20230102	0	0
2	20230103	0	0
2	20230104	0	0
2	20230105	3046.08	5
3	20230101	0	0
3	20230102	7137.92	11
3	20230103	1104.35	1
3	20230104	25	1
3	20230105	0	0

英文:

I have a pandas dataframe A with ID, date_yyyymmdd, amount and hours as shown below. Not all calendar dates are populated.

id	date_yyyymmdd	amount	hours
1	20230101	1428.95	11
1	20230103	1791.29	13
2	20230101	2516.84	15
2	20230105	3046.08	5
3	20230102	7137.92	11
3	20230103	1104.35	1
3	20230104	25	1

I would like to fill in missing calendar dates between two variables start_date and end_date and produce another dataframe B as shown below and populate amount and hours as 0s for those dates. In the example below the start date is 20230101 and end date is 20230105. I found a code that uses date as index and fills in missing value. I don't think it will work in my case. I want to fill dates for each id. How can I accomplish this? Thanks.

id	date_yyyymmdd	amount	hours
1	20230101	1428.95	11
1	20230102	0	0
1	20230103	1791.29	13
1	20230104	0	0
1	20230105	0	0
2	20230101	2516.84	15
2	20230102	0	0
2	20230103	0	0
2	20230104	0	0
2	20230105	3046.08	5
3	20230101	0	0
3	20230102	7137.92	11
3	20230103	1104.35	1
3	20230104	25	1
3	20230105	0	0

答案1

得分: 1

这是一种通过构建新的MultiIndex并使用它来reindex你的DataFrame的方法。

cols = ['id', 'date_yyyymmdd']
start_date = '1/1/2023'
end_date = '1/5/2023'
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='%Y%m%d')
df = (df.set_index(cols)
      .reindex(pd.MultiIndex.from_product([df['id'].unique(), pd.date_range(start_date, end_date, freq='D')], names=cols))
      .fillna(0)
      .sort_index()
      .reset_index())

输出：

   id date_yyyymmdd   amount  hours
0   1    2023-01-01  1428.95   11.0
1   1    2023-01-02     0.00    0.0
2   1    2023-01-03  1791.29   13.0
3   1    2023-01-04     0.00    0.0
4   1    2023-01-05     0.00    0.0
5   2    2023-01-01  2516.84   15.0
6   2    2023-01-02     0.00    0.0
7   2    2023-01-03     0.00    0.0
8   2    2023-01-04     0.00    0.0
9   2    2023-01-05  3046.08    5.0
10  3    2023-01-01     0.00    0.0
11  3    2023-01-02  7137.92   11.0
12  3    2023-01-03  1104.35    1.0
13  3    2023-01-04    25.00    1.0
14  3    2023-01-05     0.00    0.0

英文:

Here is a way by constructing a new MultiIndex, and using that to reindex your df.

cols = [&#39;id&#39;,&#39;date_yyyymmdd&#39;]
start_date = &#39;1/1/2023&#39;
end_date = &#39;1/5/2023&#39;
df[&#39;date_yyyymmdd&#39;] = pd.to_datetime(df[&#39;date_yyyymmdd&#39;],format = &#39;%Y%m%d&#39;)
df = (df.set_index(cols)
      .reindex(pd.MultiIndex.from_product([df[&#39;id&#39;].unique(),pd.date_range(start_date,end_date,freq=&#39;D&#39;)],names = cols))
      .fillna(0)
      .sort_index()
      .reset_index())

Output:

    id date_yyyymmdd   amount  hours
0    1    2023-01-01  1428.95   11.0
1    1    2023-01-02     0.00    0.0
2    1    2023-01-03  1791.29   13.0
3    1    2023-01-04     0.00    0.0
4    1    2023-01-05     0.00    0.0
5    2    2023-01-01  2516.84   15.0
6    2    2023-01-02     0.00    0.0
7    2    2023-01-03     0.00    0.0
8    2    2023-01-04     0.00    0.0
9    2    2023-01-05  3046.08    5.0
10   3    2023-01-01     0.00    0.0
11   3    2023-01-02  7137.92   11.0
12   3    2023-01-03  1104.35    1.0
13   3    2023-01-04    25.00    1.0
14   3    2023-01-05     0.00    0.0

答案2

得分: 0

以下是您提供的代码的翻译部分：

df["date_yyyymmdd"] = pd.to_datetime(df["date_yyyymmdd"], format="%Y%m%d")
r = pd.date_range(df["date_yyyymmdd"].min(), df["date_yyyymmdd"].max())
df = (
    df.groupby("id", group_keys=False)
    .apply(
        lambda x: (newdf := x.set_index("date_yyyymmdd").reindex(r)).assign(
            id=newdf["id"].ffill().bfill()
        )
    )
    .reset_index()
    .fillna(0)
)
df["id"] = df["id"].astype(int)
print(df)

打印结果如下：

        index  id   amount  hours
0  2023-01-01   1  1428.95   11.0
1  2023-01-02   1     0.00    0.0
2  2023-01-03   1  1791.29   13.0
3  2023-01-04   1     0.00    0.0
4  2023-01-05   1     0.00    0.0
5  2023-01-01   2  2516.84   15.0
6  2023-01-02   2     0.00    0.0
7  2023-01-03   2     0.00    0.0
8  2023-01-04   2     0.00    0.0
9  2023-01-05   2  3046.08    5.0
10 2023-01-01   3     0.00    0.0
11 2023-01-02   3  7137.92   11.0
12 2023-01-03   3  1104.35    1.0
13 2023-01-04   3    25.00    1.0
14 2023-01-05   3     0.00    0.0

请注意，这是您提供的代码的翻译，没有包括任何其他内容。

英文:

Try:

df[&quot;date_yyyymmdd&quot;] = pd.to_datetime(df[&quot;date_yyyymmdd&quot;], format=&quot;%Y%m%d&quot;)
r = pd.date_range(df[&quot;date_yyyymmdd&quot;].min(), df[&quot;date_yyyymmdd&quot;].max())
df = (
    df.groupby(&quot;id&quot;, group_keys=False)
    .apply(
        lambda x: (newdf := x.set_index(&quot;date_yyyymmdd&quot;).reindex(r)).assign(
            id=newdf[&quot;id&quot;].ffill().bfill()
        )
    )
    .reset_index()
    .fillna(0)
)
df[&quot;id&quot;] = df[&quot;id&quot;].astype(int)
print(df)

Prints:

        index  id   amount  hours
0  2023-01-01   1  1428.95   11.0
1  2023-01-02   1     0.00    0.0
2  2023-01-03   1  1791.29   13.0
3  2023-01-04   1     0.00    0.0
4  2023-01-05   1     0.00    0.0
5  2023-01-01   2  2516.84   15.0
6  2023-01-02   2     0.00    0.0
7  2023-01-03   2     0.00    0.0
8  2023-01-04   2     0.00    0.0
9  2023-01-05   2  3046.08    5.0
10 2023-01-01   3     0.00    0.0
11 2023-01-02   3  7137.92   11.0
12 2023-01-03   3  1104.35    1.0
13 2023-01-04   3    25.00    1.0
14 2023-01-05   3     0.00    0.0

答案3

得分: 0

一种选择是使用 pyjanitor 的 complete 函数：

# pip install pyjanitor
import janitor
import pandas as pd
df = pd.read_clipboard()
df['date_yyyymmdd'] = pd.to_datetime(df['date_yyyymmdd'], format='ISO8601')
# 创建包含所有可能日期的变量
dates = {"date_yyyymmdd": pd.date_range("2023-01-01", "2023-01-05", freq="D")}
df.complete('id', dates, fill_value=0)
    id date_yyyymmdd   amount  hours
0    1    2023-01-01  1428.95     11
1    1    2023-01-02     0.00      0
2    1    2023-01-03  1791.29     13
3    1    2023-01-04     0.00      0
4    1    2023-01-05     0.00      0
5    2    2023-01-01  2516.84     15
6    2    2023-01-02     0.00      0
7    2    2023-01-03     0.00      0
8    2    2023-01-04     0.00      0
9    2    2023-01-05  3046.08      5
10   3    2023-01-01     0.00      0
11   3    2023-01-02  7137.92     11
12   3    2023-01-03  1104.35      1
13   3    2023-01-04    25.00      1
14   3    2023-01-05     0.00      0

请注意，这是关于如何使用 pyjanitor 的 complete 函数来处理数据的示例代码。

英文:

One option is with pyjanitor's complete function:

# pip install pyjanitor
import janitor
import pandas as pd
df = pd.read_clipboard()
df[&#39;date_yyyymmdd&#39;] = pd.to_datetime(df[&#39;date_yyyymmdd&#39;],format = &#39;ISO8601&#39;)
# create variable containing all possible dates
dates = {&quot;date_yyyymmdd&quot;: pd.date_range(&quot;2023-01-01&quot;, &quot;2023-01-05&quot;, freq=&quot;D&quot;)}
df.complete(&#39;id&#39;, dates, fill_value=0)
    id date_yyyymmdd   amount  hours
0    1    2023-01-01  1428.95     11
1    1    2023-01-02     0.00      0
2    1    2023-01-03  1791.29     13
3    1    2023-01-04     0.00      0
4    1    2023-01-05     0.00      0
5    2    2023-01-01  2516.84     15
6    2    2023-01-02     0.00      0
7    2    2023-01-03     0.00      0
8    2    2023-01-04     0.00      0
9    2    2023-01-05  3046.08      5
10   3    2023-01-01     0.00      0
11   3    2023-01-02  7137.92     11
12   3    2023-01-03  1104.35      1
13   3    2023-01-04    25.00      1
14   3    2023-01-05     0.00      0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Python填充所有ID的缺失日期

问题

答案1

答案2

答案3

根据另一列具有略有不同值的 pandas 列进行屏蔽

Web抓取动态加载页面时出现问题。

AttributeError: ‘int’对象没有属性’split’，即使对象的数据类型是字符串。

传递命令行参数给已经参数化的 pytest 测试。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。