2023年5月25日 15:23:08go评论100阅读模式

英文:

Full left outer join with less/more than or equal conditions

问题

我有2个数据框，我想执行一个简单的全外连接左连接，条件是小于或等于。

d1 = {'object': ['A1', 'B2', 'C1'], 
      'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
      'end_date': ['2019-03-15', '2009-10-28', '2024-12-22']
     }
df1 = pandas.DataFrame(data=d1)
另一个数据框只是一个日历：
```python
df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df["first_day_of_month"] = df["date"].to_numpy().astype('datetime64[M]')

我想要的结果类似于以下SQL：

SELECT * FROM df1 LEFT JOIN df2
ON df1.start_date <= df2.date AND df1.end_date >= df2.date
WHERE df2.first_day_of_month = 1

因此，df1中的每一行都会与所有在df1.start_date和df2.end_date之间的月份的df2.first_day_of_month重复。

如何实现这个结果？

编辑：结果数据框将如下所示：

Object	start_date	end_date	df2.date
A1	2019-01-19	2019-03-15	2019-02-01
A1	2019-01-19	2019-03-15	2019-03-01
B2	2009-07-03	2009-10-28	2009-08-01
B2	2009-07-03	2009-10-28	2009-09-01
B2	2009-07-03	2009-10-28	2009-10-01


<details>
<summary>英文:</summary>
I have 2 dataframes that I want to perform a simple full outer left join with condition less than or equal. 
 
    d1 = {&#39;object&#39;: [&#39;A1&#39;, &#39;B2&#39;, &#39;C1&#39;], 
          &#39;start_date&#39;: [&#39;2019-01-19&#39;, &#39;2009-07-03&#39;, &#39;2021-11-23&#39;],
          &#39;end_date&#39;: [&#39;2019-03-15&#39;, &#39;2009-10-28&#39;, &#39;2024-12-22&#39;]
         }
    df1 = pandas.DataFrame(data=d1)
 
The other dataframe is simply a calendar:
    df2 = pd.DataFrame({&quot;date&quot;: pd.date_range(&#39;1990-01-01&#39;, &#39;2050-12-31&#39;)})
    df[&quot;first_day_of_month&quot;] = df[&quot;date&quot;].to_numpy().astype(&#39;datetime64[M]&#39;)
I would like to get results that are similar with this type of sql:
    SELECT * FROM df1 LEFT JOIN df2
    ON df1.start_date &lt;= df2.date AND df1.end_date &gt;= df2.date
    WHERE df2.first_day_of_month = 1
    
So every row in df1 repeated with all the ```df2.first_day_of_month```
of the months that are between ```df1.start_date``` and ```df2.end_date```.
How can I achieve this result?
Edit: the resulted dataframe would be like:
| Object   | start_date     | end_date     | df2.date     |
| -------- | -------------- |--------------|--------------|
| A1       | 2019-01-19     | 2019-03-15   | 2019-02-01   |
| A1       | 2019-01-19     | 2019-03-15   | 2019-03-01   |
| B2       | 2009-07-03     | 2009-10-28   | 2009-08-01   |
| B2       | 2009-07-03     | 2009-10-28   | 2009-09-01   |
| B2       | 2009-07-03     | 2009-10-28   | 2009-10-01   |
</details>
# 答案1
**得分**: 1
你可以使用 [`janitor`](https://pyjanitor-devs.github.io/pyjanitor) 的 [`conditional_join`](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join) 方法：
```python
# pip install janitor
import janitor
df1['start_date'] = pd.to_datetime(df1['start_date'])
df1['end_date'] = pd.to_datetime(df1['end_date'])
df2['date'] = pd.to_datetime(df2['date'])
out = df1.conditional_join(df2[df2['first_day_of_month'].dt.day.eq(1)],
                           ('start_date', 'date', '<='),
                           ('end_date', 'date', '>='),
                           how='left')

输出：

     object start_date   end_date       date first_day_of_month
0        A1 2019-01-19 2022-12-15 2019-01-19         2019-01-01
1        A1 2019-01-19 2022-12-15 2019-01-20         2019-01-01
2        A1 2019-01-19 2022-12-15 2019-01-21         2019-01-01
3        A1 2019-01-19 2022-12-15 2019-01-22         2019-01-01
4        A1 2019-01-19 2022-12-15 2019-01-23         2019-01-01
...     ...        ...        ...        ...                ...
5435     C1 2021-11-23 2024-12-22 2024-12-18         2024-12-01
5436     C1 2021-11-23 2024-12-22 2024-12-19         2024-12-01
5437     C1 2021-11-23 2024-12-22 2024-12-20         2024-12-01
5438     C1 2021-11-23 2024-12-22 2024-12-21         2024-12-01
5439     C1 2021-11-23 2024-12-22 2024-12-22         2024-12-01
[5440 rows x 5 columns]

英文:

You can use janitor's conditional_join:

# pip install janitor
import janitor
df1[&#39;start_date&#39;] = pd.to_datetime(df1[&#39;start_date&#39;])
df1[&#39;end_date&#39;] = pd.to_datetime(df1[&#39;end_date&#39;])
df2[&#39;date&#39;] = pd.to_datetime(df2[&#39;date&#39;])
out = df1.conditional_join(df2[df2[&#39;first_day_of_month&#39;].dt.day.eq(1)],
(&#39;start_date&#39;, &#39;date&#39;, &#39;&lt;=&#39;),
(&#39;end_date&#39;, &#39;date&#39;, &#39;&gt;=&#39;),
how=&#39;left&#39;)

Output:

     object start_date   end_date       date first_day_of_month
0        A1 2019-01-19 2022-12-15 2019-01-19         2019-01-01
1        A1 2019-01-19 2022-12-15 2019-01-20         2019-01-01
2        A1 2019-01-19 2022-12-15 2019-01-21         2019-01-01
3        A1 2019-01-19 2022-12-15 2019-01-22         2019-01-01
4        A1 2019-01-19 2022-12-15 2019-01-23         2019-01-01
...     ...        ...        ...        ...                ...
5435     C1 2021-11-23 2024-12-22 2024-12-18         2024-12-01
5436     C1 2021-11-23 2024-12-22 2024-12-19         2024-12-01
5437     C1 2021-11-23 2024-12-22 2024-12-20         2024-12-01
5438     C1 2021-11-23 2024-12-22 2024-12-21         2024-12-01
5439     C1 2021-11-23 2024-12-22 2024-12-22         2024-12-01
[5440 rows x 5 columns]

答案2

得分: 0

以下是翻译好的代码部分：

import pandas as pd
import numpy as np
d1 = {
    'object': ['A1', 'B2', 'C1'],
    'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
    'end_date': ['2022-12-15', '2017-05-28', '2024-12-22']
}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df2_filtered = df2[df2["date"].dt.day == 1]
# 创建一个用于连接的公共键
df1['key'] = 1
df2_filtered['key'] = 1
# 执行交叉连接
merged_df = pd.merge(df1, df2_filtered, on='key', how='outer')
# 删除键列
merged_df = merged_df.drop(columns='key')
# 应用条件
filtered_df = merged_df[
    (merged_df['start_date'] <= merged_df['date']) & (merged_df['end_date'] >= merged_df['date'])
]
filtered_df
# 重置索引
filtered_df = filtered_df.reset_index(drop=True)
# 打印结果数据框
print(filtered_df)

希望这对你有所帮助。

英文:

Hello there you can use the following logic

import pandas as pd
import numpy as np
d1 = {
&#39;object&#39;: [&#39;A1&#39;, &#39;B2&#39;, &#39;C1&#39;],
&#39;start_date&#39;: [&#39;2019-01-19&#39;, &#39;2009-07-03&#39;, &#39;2021-11-23&#39;],
&#39;end_date&#39;: [&#39;2022-12-15&#39;, &#39;2017-05-28&#39;, &#39;2024-12-22&#39;]
}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame({&quot;date&quot;: pd.date_range(&#39;1990-01-01&#39;, &#39;2050-12-31&#39;)})
df2_filtered = df2[df2[&quot;date&quot;].dt.day == 1]
# Create a common key for joining
df1[&#39;key&#39;] = 1
df2_filtered[&#39;key&#39;] = 1
# Perform the cross join
merged_df = pd.merge(df1, df2_filtered, on=&#39;key&#39;, how=&#39;outer&#39;)
# Drop the key column
merged_df = merged_df.drop(columns=&#39;key&#39;)
# Apply the conditions
filtered_df = merged_df[
(merged_df[&#39;start_date&#39;] &lt;= merged_df[&#39;date&#39;]) &amp; (merged_df[&#39;end_date&#39;] &gt;= merged_df[&#39;date&#39;])
]
filtered_df
# Reset the index
filtered_df = filtered_df.reset_index(drop=True)
# Print the resulting dataframe
print(filtered_df)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

全外连接与小于/大于或等于条件的左连接。

问题

答案2

如何在Matplotlib中为两个不同的数据源绘制多个动画。

drelu[z ≤ 0] = 0 的含义是什么？

最佳实践是更新包含列表和字典的字典列表中的字段。

在使用PyMongo时，我需要在使用后关闭MongoClient吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。