英文:
Full left outer join with less/more than or equal conditions
问题
我有2个数据框,我想执行一个简单的全外连接左连接,条件是小于或等于。
d1 = {'object': ['A1', 'B2', 'C1'],
'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
'end_date': ['2019-03-15', '2009-10-28', '2024-12-22']
}
df1 = pandas.DataFrame(data=d1)
另一个数据框只是一个日历:
```python
df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df["first_day_of_month"] = df["date"].to_numpy().astype('datetime64[M]')
我想要的结果类似于以下SQL:
SELECT * FROM df1 LEFT JOIN df2
ON df1.start_date <= df2.date AND df1.end_date >= df2.date
WHERE df2.first_day_of_month = 1
因此,df1中的每一行都会与所有在df1.start_date和df2.end_date之间的月份的df2.first_day_of_month重复。
如何实现这个结果?
编辑:结果数据框将如下所示:
Object | start_date | end_date | df2.date |
---|---|---|---|
A1 | 2019-01-19 | 2019-03-15 | 2019-02-01 |
A1 | 2019-01-19 | 2019-03-15 | 2019-03-01 |
B2 | 2009-07-03 | 2009-10-28 | 2009-08-01 |
B2 | 2009-07-03 | 2009-10-28 | 2009-09-01 |
B2 | 2009-07-03 | 2009-10-28 | 2009-10-01 |
<details>
<summary>英文:</summary>
I have 2 dataframes that I want to perform a simple full outer left join with condition less than or equal.
d1 = {'object': ['A1', 'B2', 'C1'],
'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
'end_date': ['2019-03-15', '2009-10-28', '2024-12-22']
}
df1 = pandas.DataFrame(data=d1)
The other dataframe is simply a calendar:
df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df["first_day_of_month"] = df["date"].to_numpy().astype('datetime64[M]')
I would like to get results that are similar with this type of sql:
SELECT * FROM df1 LEFT JOIN df2
ON df1.start_date <= df2.date AND df1.end_date >= df2.date
WHERE df2.first_day_of_month = 1
So every row in df1 repeated with all the ```df2.first_day_of_month```
of the months that are between ```df1.start_date``` and ```df2.end_date```.
How can I achieve this result?
Edit: the resulted dataframe would be like:
| Object | start_date | end_date | df2.date |
| -------- | -------------- |--------------|--------------|
| A1 | 2019-01-19 | 2019-03-15 | 2019-02-01 |
| A1 | 2019-01-19 | 2019-03-15 | 2019-03-01 |
| B2 | 2009-07-03 | 2009-10-28 | 2009-08-01 |
| B2 | 2009-07-03 | 2009-10-28 | 2009-09-01 |
| B2 | 2009-07-03 | 2009-10-28 | 2009-10-01 |
</details>
# 答案1
**得分**: 1
你可以使用 [`janitor`](https://pyjanitor-devs.github.io/pyjanitor) 的 [`conditional_join`](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join) 方法:
```python
# pip install janitor
import janitor
df1['start_date'] = pd.to_datetime(df1['start_date'])
df1['end_date'] = pd.to_datetime(df1['end_date'])
df2['date'] = pd.to_datetime(df2['date'])
out = df1.conditional_join(df2[df2['first_day_of_month'].dt.day.eq(1)],
('start_date', 'date', '<='),
('end_date', 'date', '>='),
how='left')
输出:
object start_date end_date date first_day_of_month
0 A1 2019-01-19 2022-12-15 2019-01-19 2019-01-01
1 A1 2019-01-19 2022-12-15 2019-01-20 2019-01-01
2 A1 2019-01-19 2022-12-15 2019-01-21 2019-01-01
3 A1 2019-01-19 2022-12-15 2019-01-22 2019-01-01
4 A1 2019-01-19 2022-12-15 2019-01-23 2019-01-01
... ... ... ... ... ...
5435 C1 2021-11-23 2024-12-22 2024-12-18 2024-12-01
5436 C1 2021-11-23 2024-12-22 2024-12-19 2024-12-01
5437 C1 2021-11-23 2024-12-22 2024-12-20 2024-12-01
5438 C1 2021-11-23 2024-12-22 2024-12-21 2024-12-01
5439 C1 2021-11-23 2024-12-22 2024-12-22 2024-12-01
[5440 rows x 5 columns]
英文:
You can use janitor
's conditional_join
:
# pip install janitor
import janitor
df1['start_date'] = pd.to_datetime(df1['start_date'])
df1['end_date'] = pd.to_datetime(df1['end_date'])
df2['date'] = pd.to_datetime(df2['date'])
out = df1.conditional_join(df2[df2['first_day_of_month'].dt.day.eq(1)],
('start_date', 'date', '<='),
('end_date', 'date', '>='),
how='left')
Output:
object start_date end_date date first_day_of_month
0 A1 2019-01-19 2022-12-15 2019-01-19 2019-01-01
1 A1 2019-01-19 2022-12-15 2019-01-20 2019-01-01
2 A1 2019-01-19 2022-12-15 2019-01-21 2019-01-01
3 A1 2019-01-19 2022-12-15 2019-01-22 2019-01-01
4 A1 2019-01-19 2022-12-15 2019-01-23 2019-01-01
... ... ... ... ... ...
5435 C1 2021-11-23 2024-12-22 2024-12-18 2024-12-01
5436 C1 2021-11-23 2024-12-22 2024-12-19 2024-12-01
5437 C1 2021-11-23 2024-12-22 2024-12-20 2024-12-01
5438 C1 2021-11-23 2024-12-22 2024-12-21 2024-12-01
5439 C1 2021-11-23 2024-12-22 2024-12-22 2024-12-01
[5440 rows x 5 columns]
答案2
得分: 0
以下是翻译好的代码部分:
import pandas as pd
import numpy as np
d1 = {
'object': ['A1', 'B2', 'C1'],
'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
'end_date': ['2022-12-15', '2017-05-28', '2024-12-22']
}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df2_filtered = df2[df2["date"].dt.day == 1]
# 创建一个用于连接的公共键
df1['key'] = 1
df2_filtered['key'] = 1
# 执行交叉连接
merged_df = pd.merge(df1, df2_filtered, on='key', how='outer')
# 删除键列
merged_df = merged_df.drop(columns='key')
# 应用条件
filtered_df = merged_df[
(merged_df['start_date'] <= merged_df['date']) & (merged_df['end_date'] >= merged_df['date'])
]
filtered_df
# 重置索引
filtered_df = filtered_df.reset_index(drop=True)
# 打印结果数据框
print(filtered_df)
希望这对你有所帮助。
英文:
Hello there you can use the following logic
import pandas as pd
import numpy as np
d1 = {
'object': ['A1', 'B2', 'C1'],
'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
'end_date': ['2022-12-15', '2017-05-28', '2024-12-22']
}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df2_filtered = df2[df2["date"].dt.day == 1]
# Create a common key for joining
df1['key'] = 1
df2_filtered['key'] = 1
# Perform the cross join
merged_df = pd.merge(df1, df2_filtered, on='key', how='outer')
# Drop the key column
merged_df = merged_df.drop(columns='key')
# Apply the conditions
filtered_df = merged_df[
(merged_df['start_date'] <= merged_df['date']) & (merged_df['end_date'] >= merged_df['date'])
]
filtered_df
# Reset the index
filtered_df = filtered_df.reset_index(drop=True)
# Print the resulting dataframe
print(filtered_df)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论