英文:
How to get difference of months replicating Excel behavior
问题
以下是翻译好的部分:
我想要**减去**两个日期列并获取它们的**差异**,以**月**为单位,而不是天,但一直没有成功。
数据:
```python
import polars as pl
from datetime import datetime
test_df = pl.DataFrame(
{
"dt_end": [datetime(2022, 1, 1).date(), datetime(2022, 1, 2).date()],
"dt_start": [datetime(2021, 5, 7).date(), datetime(2020, 7, 8).date()],
}
)
这会给出天数的差异:
test_df.with_columns(
diff_months = pl.col('dt_end') - pl.col('dt_start')
)
输出:
dt_end dt_start diff_months
date date duration[ms]
2022-01-01 2021-05-07 239d
2022-01-02 2020-07-08 543d
我尝试了下面的代码来获取月份,但是不起作用:
test_df.with_columns(
diff_months = pl.duration(months = pl.col('dt_end') - pl.col('dt_start'))
)
更新
我还尝试了下面的代码,但它也给我提供了与Excel不同的值。
test_df.with_columns(
diff_months = (pl.col('dt_end').dt.year() - pl.col('dt_start').dt.year()) * 12 +
(pl.col('dt_end').dt.month().cast(pl.Int32) - pl.col('dt_start').dt.month().cast(pl.Int32))
)
输出:
dt_end dt_start diff_months
date date i32
2022-01-01 2021-05-07 8
2022-01-02 2020-07-08 18
更新 2:
我认为这是以30.42作为每月平均天数的手动方法:
test_df.with_columns(
diff_months = pl.col('dt_end') - pl.col('dt_start')
).with_columns(
diff_months = (pl.col('diff_months').dt.days()/30.42).floor().cast(pl.Int8)
)
输出
dt_end dt_start diff_months
date date i8
2022-01-01 2021-05-07 7
2022-01-02 2020-07-08 17
更新 3:
实际数据的快照,带有期望的 Excel 值参考:
# 数据
snapshot_df = pl.DataFrame(
{
"Reporting Month": [datetime(2000, 7, 20).date(), datetime(2000, 8, 20).date(), datetime(2000, 9, 20).date()],
"Origination date": [datetime(1999, 12, 19).date(), datetime(1999, 12, 19).date(), datetime(1999, 12, 19).date()],
"Excel_Reference_Age": [7,8,9]
}
)
比较
snapshot_df.with_columns(
diff_months = pl.col('Reporting Month') - pl.col('Origination date')
).with_columns(
diff_months = (pl.col('diff_months').dt.days()/30.42).floor().cast(pl.Int16),
diff_months_1 = (pl.col('Reporting Month').dt.year() - pl.col('Origination date').dt.year()) * 12 +
(pl.col('Reporting Month').dt.month().cast(pl.Int32) - pl.col('Origination date').dt.month().cast(pl.Int32)) ,
diff_months_2 = pl.date_range(pl.col("Origination date"), pl.col("Reporting Month"), "1mo").list.lengths()
).select('Excel_Reference_Age','diff_months','diff_months_1','diff_months_2)
输出
Excel_Reference_Age diff_months diff_months_1 diff_months_2
i64 i16 i32 u32
7 7 7 8
8 8 8 9
9 9 9 10
<details>
<summary>英文:</summary>
I would like to **subtract** `two date columns` and get their **difference** in **months** `unit` instead of days but haven't been able to.
Data:
import polars as pl
from datetime import datetime
test_df = pl.DataFrame(
{
"dt_end": [datetime(2022, 1, 1).date(), datetime(2022, 1, 2).date()],
"dt_start": [datetime(2021, 5, 7).date(), datetime(2020, 7, 8).date()],
}
)
This gives difference in Days
test_df.with_columns(
diff_months = pl.col('dt_end') - pl.col('dt_start')
)
output:
dt_end dt_start diff_months
date date duration[ms]
2022-01-01 2021-05-07 239d
2022-01-02 2020-07-08 543d
I have tried below code for **months** but that **doesn't work**
test_df.with_columns(
diff_months = pl.duration(months = pl.col('dt_end') - pl.col('dt_start'))
)
**UPDATE**
Have also tried below code but that also gives me a different value from the excel.
test_df.with_columns(
diff_months = (pl.col('dt_end').dt.year() - pl.col('dt_start').dt.year()) * 12 +
(pl.col('dt_end').dt.month().cast(pl.Int32) - pl.col('dt_start').dt.month().cast(pl.Int32))
)
output:
dt_end dt_start diff_months
date date i32
2022-01-01 2021-05-07 8
2022-01-02 2020-07-08 18
**UPDATE 2:**
I think this is the **manual** way of doing it by using **30.42** as the **avg** number of days in month
test_df.with_columns(
diff_months = pl.col('dt_end') - pl.col('dt_start')
).with_columns(
diff_months = (pl.col('diff_months').dt.days()/30.42).floor().cast(pl.Int8)
)
output
dt_end dt_start diff_months
date date i8
2022-01-01 2021-05-07 7
2022-01-02 2020-07-08 17
**UPDATE 3:** Snapshot of actual data with desired Reference excel value
Data
snapshot_df = pl.DataFrame(
{
"Reporting Month": [datetime(2000, 7, 20).date(), datetime(2000, 8, 20).date(), datetime(2000, 9, 20).date()],
"Origination date": [datetime(1999, 12, 19).date(), datetime(1999, 12, 19).date(), datetime(1999, 12, 19).date()],
"Excel_Reference_Age": [7,8,9]
}
)
comparison
snapshot_df.with_columns(
diff_months = pl.col('Reporting Month') - pl.col('Origination date')
).with_columns(
diff_months = (pl.col('diff_months').dt.days()/30.42).floor().cast(pl.Int16),
diff_months_1 = (pl.col('Reporting Month').dt.year() - pl.col('Origination date').dt.year()) * 12 +
(pl.col('Reporting Month').dt.month().cast(pl.Int32) - pl.col('Origination date').dt.month().cast(pl.Int32)) ,
diff_months_2 = pl.date_range(pl.col("Origination date"), pl.col("Reporting Month"), "1mo").list.lengths()
).select('Excel_Reference_Age','diff_months','diff_months_1','diff_months_2')
output
Excel_Reference_Age diff_months diff_months_1 diff_months_2
i64 i16 i32 u32
7 7 7 8
8 8 8 9
9 9 9 10
</details>
# 答案1
**得分**: 1
你的更新#1没问题,只需在`dt.days`对于`end`尚未达到`start`的情况下也进行+1调整:
```python
test_df.with_columns(
diff_months=12 * (pl.col('dt_end').dt.year() - pl.col('dt_start').dt.year())
+ (
pl.col('dt_end').dt.month().cast(pl.Int32)
- pl.col('dt_start').dt.month().cast(pl.Int32)
)
- (pl.col('dt_end').dt.day() < pl.col('dt_start').dt.day()).cast(pl.Int32)
)
英文:
Your update #1 is just fine, you just also need to adjust by 1 if dt.days
for end
has not reached start
:
test_df.with_columns(
diff_months=12 * (pl.col('dt_end').dt.year() - pl.col('dt_start').dt.year())
+ (
pl.col('dt_end').dt.month().cast(pl.Int32)
- pl.col('dt_start').dt.month().cast(pl.Int32)
)
- (pl.col('dt_end').dt.day() < pl.col('dt_start').dt.day()).cast(pl.Int32)
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论