重采样数据框以添加缺失的日期。

huangapple go评论101阅读模式
英文:

Resample dataframe to add missing dates

问题

我有一个包含多个字符串列、一个日期列和一个整数值列的数据框。

我想要为每组文本列“ffill”缺失的日期。缺失的日期是数据框中最小日期到最大日期的所有日期。我认为使用示例更容易理解。

示例输入:

  1. group rtype location hardware date value
  2. my-group type-s NY DTop 2020-08-05 10
  3. my-group type-s NY DTop 2020-08-07 20
  4. my-group type-s NY DTop 2020-08-10 30
  5. my-group type-s NY Tower 2020-08-01 40
  6. my-group type-s NY Tower 2020-08-07 50
  7. ot-group type-t NY LTop 2020-08-08 90

此数据框的最小和最大日期:(起始日期)2020-08-01 -(结束日期)2020-08-10

示例输出:

  1. group rtype location hardware date value
  2. my-group type-s NY DTop 2020-08-01 0
  3. my-group type-s NY DTop 2020-08-02 0
  4. my-group type-s NY DTop 2020-08-03 0
  5. my-group type-s NY DTop 2020-08-04 0
  6. my-group type-s NY DTop 2020-08-05 10
  7. my-group type-s NY DTop 2020-08-06 10
  8. my-group type-s NY DTop 2020-08-07 20
  9. my-group type-s NY DTop 2020-08-08 20
  10. my-group type-s NY DTop 2020-08-09 20
  11. my-group type-s NY DTop 2020-08-10 30
  12. my-group type-s NY Tower 2020-08-01 40
  13. my-group type-s NY Tower 2020-08-02 40
  14. my-group type-s NY Tower 2020-08-03 40
  15. my-group type-s NY Tower 2020-08-04 40
  16. my-group type-s NY Tower 2020-08-05 40
  17. my-group type-s NY Tower 2020-08-06 40
  18. my-group type-s NY Tower 2020-08-07 50
  19. my-group type-s NY Tower 2020-08-08 50
  20. my-group type-s NY Tower 2020-08-09 50
  21. my-group type-s NY Tower 2020-08-10 50
  22. ot-group type-t NY LTop 2020-08-01 0
  23. ot-group type-t NY LTop 2020-08-02 0
  24. ot-group type-t NY LTop 2020-08-03 0
  25. ot-group type-t NY LTop 2020-08-04 0
  26. ot-group type-t NY LTop 2020-08-05 0
  27. ot-group type-t NY LTop 2020-08-06 0
  28. ot-group type-t NY LTop 2020-08-07 0
  29. ot-group type-t NY LTop 2020-08-08 90
  30. ot-group type-t NY LTop 2020-08-09 90
  31. ot-group type-t NY LTop 2020-08-10 90

在这个示例中,我将位置保持不变,以避免额外的长输出。

我可以使用pd.date_range()来获取所需的日期。

我尝试使用多索引的resample,但遇到了错误(类似于此问题)。

我尝试了此答案中提到的方法,但似乎不起作用。

以下是我使用的代码:

  1. import pandas as pd
  2. df = pd.read_csv('data.csv')
  3. df.set_index('date', inplace=True)
  4. date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
  5. print(len(date_range), date_range)
  6. def reindex_by_date(df):
  7. return df.reindex(date_range).ffill()
  8. df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
  9. print(df.to_string())

此代码的输出:

  1. 10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
  2. '2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
  3. '2020-08-09', '2020-08-10'],
  4. dtype='datetime64[ns]', freq='D')
  5. group rtype location hardware value
  6. 2020-08-01 NaN NaN NaN NaN NaN
  7. 2020-08-02 NaN NaN NaN NaN NaN
  8. 2020-08-03 NaN NaN NaN NaN NaN
  9. 2020-08-04 NaN NaN NaN NaN NaN
  10. 2020-08-05 NaN NaN NaN NaN NaN
  11. 2020-08-06 NaN NaN NaN NaN NaN
  12. 2020-08-07 NaN NaN NaN NaN NaN
  13. 2020-08-08 NaN NaN NaN NaN NaN
  14. 2020-08-09 NaN NaN NaN NaN NaN
  15. 2020-08-10 NaN NaN NaN NaN NaN
  16. 2020-08-01 NaN NaN NaN NaN NaN
  17. 2020-08-02 NaN NaN NaN NaN NaN
  18. 2020-08-03 NaN NaN NaN NaN NaN
  19. 2020-08-04 NaN NaN
  20. <details>
  21. <summary>英文:</summary>
  22. I have a dataframe with multiple string columns, one date column and one int value column.
  23. I want to `ffill` the missing dates for each group of text columns. The missing dates are all dates from the min date to max date in the dataframe. I think this is better explained using an example.
  24. Sample Input:

group rtype location hardware date value
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-07 50
ot-group type-t NY LTop 2020-08-08 90

  1. Min and Max date for this dataframe: (start_date) 2020-08-01 - (end_date) 2020-08-10
  2. Sample Output:

group rtype location hardware date value
my-group type-s NY DTop 2020-08-01 0
my-group type-s NY DTop 2020-08-02 0
my-group type-s NY DTop 2020-08-03 0
my-group type-s NY DTop 2020-08-04 0
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-06 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-08 20
my-group type-s NY DTop 2020-08-09 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-02 40
my-group type-s NY Tower 2020-08-03 40
my-group type-s NY Tower 2020-08-04 40
my-group type-s NY Tower 2020-08-05 40
my-group type-s NY Tower 2020-08-06 40
my-group type-s NY Tower 2020-08-07 50
my-group type-s NY Tower 2020-08-08 50
my-group type-s NY Tower 2020-08-09 50
my-group type-s NY Tower 2020-08-10 50
ot-group type-t NY LTop 2020-08-01 0
ot-group type-t NY LTop 2020-08-02 0
ot-group type-t NY LTop 2020-08-03 0
ot-group type-t NY LTop 2020-08-04 0
ot-group type-t NY LTop 2020-08-05 0
ot-group type-t NY LTop 2020-08-06 0
ot-group type-t NY LTop 2020-08-07 0
ot-group type-t NY LTop 2020-08-08 90
ot-group type-t NY LTop 2020-08-09 90
ot-group type-t NY LTop 2020-08-10 90

  1. In this example, I kept the location fixed to avoid an extra long output.
  2. I am able to get the dates I want using `pd.date_range()`.
  3. I tried using `resample` with multiindex but I run into errors (similar to [this](https://stackoverflow.com/questions/15799162/resampling-within-a-pandas-multiindex)).
  4. I tried the approach mentioned in [this answer](https://stackoverflow.com/a/32275705/1872234) but it doesn&#39;t seem to work:
  5. My code using:

import pandas as pd

df = pd.read_csv('data.csv')
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')

print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill()

df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
print(df.to_string())

  1. Output of this code:

10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
'2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
'2020-08-09', '2020-08-10'],
dtype='datetime64[ns]', freq='D')
group rtype location hardware value
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN

  1. Can someone help please?
  2. **EDIT:**
  3. After fixing the DatetimeIndex issue, and using `fillna(0)`:

df = pd.read_csv('data.csv', parse_dates=['date'])
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')

print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill().fillna(0)

df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True).reset_index().rename(columns={'index': 'date'})

print(df.to_string())

  1. Output:
  1. group rtype location hardware value

2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 my-group type-s NY DTop 10.0
2020-08-06 my-group type-s NY DTop 10.0
2020-08-07 my-group type-s NY DTop 20.0
2020-08-08 my-group type-s NY DTop 20.0
2020-08-09 my-group type-s NY DTop 20.0
2020-08-10 my-group type-s NY DTop 30.0
2020-08-01 my-group type-s NY Tower 40.0
2020-08-02 my-group type-s NY Tower 40.0
2020-08-03 my-group type-s NY Tower 40.0
2020-08-04 my-group type-s NY Tower 40.0
2020-08-05 my-group type-s NY Tower 40.0
2020-08-06 my-group type-s NY Tower 40.0
2020-08-07 my-group type-s NY Tower 50.0
2020-08-08 my-group type-s NY Tower 50.0
2020-08-09 my-group type-s NY Tower 50.0
2020-08-10 my-group type-s NY Tower 50.0
2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 0 0 0 0 0.0
2020-08-06 0 0 0 0 0.0
2020-08-07 0 0 0 0 0.0
2020-08-08 ot-group type-t NY LTop 90.0
2020-08-09 ot-group type-t NY LTop 90.0
2020-08-10 ot-group type-t NY LTop 90.0

  1. </details>
  2. # 答案1
  3. **得分**: 3
  4. 你忘记将`date`列转换为`DatetimeIndex`:
  5. ```python
  6. df = pd.read_csv('data.csv', parse_dates=['date'])
  7. # 在此处添加你的代码

另一种方法是使用pivot_table

  1. # 如果你使用了 pd.read_csv(..., parse_dates=['date']) 则不需要以下行
  2. df['date'] = pd.to_datetime(df['date'])
  3. dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
  4. cols = ['group', 'rtype', 'location', 'hardware']
  5. out = (df.pivot_table(index='date', columns=cols, values='value')
  6. .reindex(dti.rename('date')).ffill().fillna(0)
  7. .unstack().rename('value').reset_index())

输出:

  1. >>> out
  2. group rtype location hardware date value
  3. 0 my-group type-s NY DTop 2020-08-01 0.0
  4. 1 my-group type-s NY DTop 2020-08-02 0.0
  5. 2 my-group type-s NY DTop 2020-08-03 0.0
  6. 3 my-group type-s NY DTop 2020-08-04 0.0
  7. 4 my-group type-s NY DTop 2020-08-05 10.0
  8. 5 my-group type-s NY DTop 2020-08-06 10.0
  9. 6 my-group type-s NY DTop 2020-08-07 20.0
  10. 7 my-group type-s NY DTop 2020-08-08 20.0
  11. 8 my-group type-s NY DTop 2020-08-09 20.0
  12. 9 my-group type-s NY DTop 2020-08-10 30.0
  13. 10 my-group type-s NY Tower 2020-08-01 40.0
  14. 11 my-group type-s NY Tower 2020-08-02 40.0
  15. 12 my-group type-s NY Tower 2020-08-03 40.0
  16. 13 my-group type-s NY Tower 2020-08-04 40.0
  17. 14 my-group type-s NY Tower 2020-08-05 40.0
  18. 15 my-group type-s NY Tower 2020-08-06 40.0
  19. 16 my-group type-s NY Tower 2020-08-07 50.0
  20. 17 my-group type-s NY Tower 2020-08-08 50.0
  21. 18 my-group type-s NY Tower 2020-08-09 50.0
  22. 19 my-group type-s NY Tower 2020-08-10 50.0
  23. 20 ot-group type-t NY LTop 2020-08-01 0.0
  24. 21 ot-group type-t NY LTop 2020-08-02 0.0
  25. 22 ot-group type-t NY LTop 2020-08-03 0.0
  26. 23 ot-group type-t NY LTop 2020-08-04 0.0
  27. 24 ot-group type-t NY LTop 2020-08-05 0.0
  28. 25 ot-group type-t NY LTop 2020-08-06 0.0
  29. 26 ot-group type-t NY LTop 2020-08-07 0.0
  30. 27 ot-group type-t NY LTop 2020-08-08 90.0
  31. 28 ot-group type-t NY LTop 2020-08-09 90.0
  32. 29 ot-group type-t NY LTop 2020-08-10 90.0

然而,只使用pivot_table而不使用unstack可能更好:

  1. >>> (df.pivot_table(index='date', columns=cols, values='value')
  2. .reindex(dti.rename('date')).ffill().fillna(0))
  3. group my-group ot-group
  4. rtype type-s type-t
  5. location NY NY
  6. hardware DTop Tower LTop
  7. date
  8. 2020-08-01 0.0 40.0 0.0
  9. 2020-08-02 0.0 40.0 0.0
  10. 2020-08-03 0.0 40.0 0.0
  11. 2020-08-04 0.0 40.0 0.0
  12. 2020-08-05 10.0 40.0 0.0
  13. 2020-08-06 10.0 40.0 0.0
  14. 2020-08-07 20.0 50.0 0.0
  15. 2020-08-08 20.0 50.0 90.0
  16. 2020-08-09 20.0 50.0 90.0
  17. 2020-08-10 30.0 50.0 90.0
英文:

You forgot to convert date column as DatetimeIndex:

  1. df = pd.read_csv(&#39;data.csv&#39;, parse_dates=[&#39;date&#39;])
  2. # your code here

Another way with pivot_table:

  1. # Not needed if you used pd.read_csv(..., parse_dates=[&#39;date&#39;])
  2. df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])
  3. dti = pd.date_range(df[&#39;date&#39;].min(), df[&#39;date&#39;].max(), freq=&#39;D&#39;)
  4. cols = [&#39;group&#39;,&#39;rtype&#39;,&#39;location&#39;,&#39;hardware&#39;]
  5. out = (df.pivot_table(index=&#39;date&#39;, columns=cols, values=&#39;value&#39;)
  6. .reindex(dti.rename(&#39;date&#39;)).ffill().fillna(0)
  7. .unstack().rename(&#39;value&#39;).reset_index())

Output:

  1. &gt;&gt;&gt; out
  2. group rtype location hardware date value
  3. 0 my-group type-s NY DTop 2020-08-01 0.0
  4. 1 my-group type-s NY DTop 2020-08-02 0.0
  5. 2 my-group type-s NY DTop 2020-08-03 0.0
  6. 3 my-group type-s NY DTop 2020-08-04 0.0
  7. 4 my-group type-s NY DTop 2020-08-05 10.0
  8. 5 my-group type-s NY DTop 2020-08-06 10.0
  9. 6 my-group type-s NY DTop 2020-08-07 20.0
  10. 7 my-group type-s NY DTop 2020-08-08 20.0
  11. 8 my-group type-s NY DTop 2020-08-09 20.0
  12. 9 my-group type-s NY DTop 2020-08-10 30.0
  13. 10 my-group type-s NY Tower 2020-08-01 40.0
  14. 11 my-group type-s NY Tower 2020-08-02 40.0
  15. 12 my-group type-s NY Tower 2020-08-03 40.0
  16. 13 my-group type-s NY Tower 2020-08-04 40.0
  17. 14 my-group type-s NY Tower 2020-08-05 40.0
  18. 15 my-group type-s NY Tower 2020-08-06 40.0
  19. 16 my-group type-s NY Tower 2020-08-07 50.0
  20. 17 my-group type-s NY Tower 2020-08-08 50.0
  21. 18 my-group type-s NY Tower 2020-08-09 50.0
  22. 19 my-group type-s NY Tower 2020-08-10 50.0
  23. 20 ot-group type-t NY LTop 2020-08-01 0.0
  24. 21 ot-group type-t NY LTop 2020-08-02 0.0
  25. 22 ot-group type-t NY LTop 2020-08-03 0.0
  26. 23 ot-group type-t NY LTop 2020-08-04 0.0
  27. 24 ot-group type-t NY LTop 2020-08-05 0.0
  28. 25 ot-group type-t NY LTop 2020-08-06 0.0
  29. 26 ot-group type-t NY LTop 2020-08-07 0.0
  30. 27 ot-group type-t NY LTop 2020-08-08 90.0
  31. 28 ot-group type-t NY LTop 2020-08-09 90.0
  32. 29 ot-group type-t NY LTop 2020-08-10 90.0

However, it could be better to use only pivot_table without unstack:

  1. &gt;&gt;&gt; (df.pivot_table(index=&#39;date&#39;, columns=cols, values=&#39;value&#39;)
  2. .reindex(dti.rename(&#39;date&#39;)).ffill().fillna(0))
  3. group my-group ot-group
  4. rtype type-s type-t
  5. location NY NY
  6. hardware DTop Tower LTop
  7. date
  8. 2020-08-01 0.0 40.0 0.0
  9. 2020-08-02 0.0 40.0 0.0
  10. 2020-08-03 0.0 40.0 0.0
  11. 2020-08-04 0.0 40.0 0.0
  12. 2020-08-05 10.0 40.0 0.0
  13. 2020-08-06 10.0 40.0 0.0
  14. 2020-08-07 20.0 50.0 0.0
  15. 2020-08-08 20.0 50.0 90.0
  16. 2020-08-09 20.0 50.0 90.0
  17. 2020-08-10 30.0 50.0 90.0

答案2

得分: 0

以下是代码部分的翻译:

  1. # pip install pyjanitor
  2. import janitor
  3. import pandas as pd
  4. df = pd.read_clipboard(parse_dates=['date'])
  5. dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
  6. cols = ['group', 'rtype', 'location', 'hardware']
  7. (df
  8. .complete(cols, {'date': dti})
  9. .assign(value = lambda f: f.value.ffill().fillna(0, downcast='infer'))
  10. )

请注意,这是代码的翻译,没有其他内容。

英文:

One option is with complete from pyjanitor:

  1. # pip install pyjanitor
  2. import janitor
  3. import pandas as pd
  4. df = pd.read_clipboard(parse_dates=[&#39;date&#39;])
  5. dti = pd.date_range(df[&#39;date&#39;].min(), df[&#39;date&#39;].max(), freq=&#39;D&#39;)
  6. cols = [&#39;group&#39;,&#39;rtype&#39;,&#39;location&#39;,&#39;hardware&#39;]
  7. (df
  8. .complete(cols, {&#39;date&#39;: dti})
  9. .assign(value = lambda f: f.value.ffill().fillna(0, downcast=&#39;infer&#39;))
  10. )
  11. group rtype location hardware date value
  12. 0 my-group type-s NY DTop 2020-08-01 0
  13. 1 my-group type-s NY DTop 2020-08-02 0
  14. 2 my-group type-s NY DTop 2020-08-03 0
  15. 3 my-group type-s NY DTop 2020-08-04 0
  16. 4 my-group type-s NY DTop 2020-08-05 10
  17. 5 my-group type-s NY DTop 2020-08-06 10
  18. 6 my-group type-s NY DTop 2020-08-07 20
  19. 7 my-group type-s NY DTop 2020-08-08 20
  20. 8 my-group type-s NY DTop 2020-08-09 20
  21. 9 my-group type-s NY DTop 2020-08-10 30
  22. 10 my-group type-s NY Tower 2020-08-01 40
  23. 11 my-group type-s NY Tower 2020-08-02 40
  24. 12 my-group type-s NY Tower 2020-08-03 40
  25. 13 my-group type-s NY Tower 2020-08-04 40
  26. 14 my-group type-s NY Tower 2020-08-05 40
  27. 15 my-group type-s NY Tower 2020-08-06 40
  28. 16 my-group type-s NY Tower 2020-08-07 50
  29. 17 my-group type-s NY Tower 2020-08-08 50
  30. 18 my-group type-s NY Tower 2020-08-09 50
  31. 19 my-group type-s NY Tower 2020-08-10 50
  32. 20 ot-group type-t NY LTop 2020-08-01 50
  33. 21 ot-group type-t NY LTop 2020-08-02 50
  34. 22 ot-group type-t NY LTop 2020-08-03 50
  35. 23 ot-group type-t NY LTop 2020-08-04 50
  36. 24 ot-group type-t NY LTop 2020-08-05 50
  37. 25 ot-group type-t NY LTop 2020-08-06 50
  38. 26 ot-group type-t NY LTop 2020-08-07 50
  39. 27 ot-group type-t NY LTop 2020-08-08 90
  40. 28 ot-group type-t NY LTop 2020-08-09 90
  41. 29 ot-group type-t NY LTop 2020-08-10 90

@corralien's suggestion of a pivot table without an unstack might be a better one, it all depends on your end goal

huangapple
  • 本文由 发表于 2023年5月25日 12:17:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76328903.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定