英文:
How do you fill missing dates in a Polars dataframe (python)?
问题
我找不到Polars库的等效项。但基本上,我想要做的是在一个大型数据框之间填充两个日期之间的缺失日期。由于数据的大小大于100百万,所以必须使用Polars。
以下是我用于Pandas的代码,但如何在Polars中执行相同操作呢?
import janitor
import polars as pl
from datetime import datetime, timedelta
def missing_date_filler(d):
df = d.copy()
time_back = 1 # 回溯的天数
td = pl.DataFrame({"now": [pl.datetime().now()]})
helper = pl.DataFrame({"helper": [pl.duration.days(time_back)])
max_date = (td - helper).to_date().to_list() # 获取今天的日期减去1天
df_date = pl.date_range(start=df['Date'].min().date(),
end=max_date[0],
freq='1D').to_frame(["Date"]) # 添加从最早日期到昨天的完整日期范围
df = df.complete(["Col_A", "Col_B"],
right=df_date).sort("Date") # 填充缺失的日期
return df
请注意,我已经将代码中的Pandas函数替换为Polars函数,以实现相同的功能。
英文:
I do not seem to find an equivalent for Polars library. But basically, what I want to do is fill missing dates between two dates for a big dataframe. It has to be Polars because of the size of the data (> 100 mill).
Below is the code I use for Pandas, but how can I do the same thing for Polars?
import janitor
import pandas as pd
from datetime import datetime, timedelta
def missing_date_filler(d):
df = d.copy()
time_back = 1 # Look back in days
td = pd.to_datetime(datetime.now().strftime("%Y-%m-%d"))
helper = timedelta(days=time_back)
max_date = (td - helper).strftime("%Y-%m-%d") # Takes todays date minus 1 day
df_date = dict(Date = pd.date_range(df.Date.min(),
max_date,
freq='1D')) # Adds the full date range between the earliest date up until yesterday
df = df.complete(['Col_A', 'Col_B'],
df_date).sort_values("Date") # Filling the missing dates
return df
答案1
得分: 3
看起来你正在寻找.upsample()
函数。
注意,你可以使用 by
参数以分组方式执行操作。
import polars as pl
from datetime import datetime
df = pl.DataFrame({
"date": [datetime(2023, 1, 2), datetime(2023, 1, 5)],
"value": [1, 2]
})
形状:(2, 2)
┌─────────────────────┬───────┐
│ date | value │
│ --- | --- │
│ datetime[μs] | i64 │
╞═════════════════════╪═══════╡
│ 2023-01-02 00:00:00 | 1 │
│ 2023-01-05 00:00:00 | 2 │
└─────────────────────┴───────┘
>>> df.upsample(time_column="date", every="1d")
形状:(4, 2)
┌─────────────────────┬───────┐
│ date | value │
│ --- | --- │
│ datetime[μs] | i64 │
╞═════════════════════╪═══════╡
│ 2023-01-02 00:00:00 | 1 │
│ 2023-01-03 00:00:00 | null │
│ 2023-01-04 00:00:00 | null │
│ 2023-01-05 00:00:00 | 2 │
└─────────────────────┴───────┘
英文:
It sounds like you're looking for .upsample()
Note that you can use the by
parameter to perform the operation on a per-group basis.
import polars as pl
from datetime import datetime
df = pl.DataFrame({
"date": [datetime(2023, 1, 2), datetime(2023, 1, 5)],
"value": [1, 2]
})
shape: (2, 2)
┌─────────────────────┬───────┐
│ date | value │
│ --- | --- │
│ datetime[μs] | i64 │
╞═════════════════════╪═══════╡
│ 2023-01-02 00:00:00 | 1 │
│ 2023-01-05 00:00:00 | 2 │
└─────────────────────┴───────┘
>>> df.upsample(time_column="date", every="1d")
shape: (4, 2)
┌─────────────────────┬───────┐
│ date | value │
│ --- | --- │
│ datetime[μs] | i64 │
╞═════════════════════╪═══════╡
│ 2023-01-02 00:00:00 | 1 │
│ 2023-01-03 00:00:00 | null │
│ 2023-01-04 00:00:00 | null │
│ 2023-01-05 00:00:00 | 2 │
└─────────────────────┴───────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论