全外连接与小于/大于或等于条件的左连接。

huangapple go评论65阅读模式
英文:

Full left outer join with less/more than or equal conditions

问题

我有2个数据框,我想执行一个简单的全外连接左连接,条件是小于或等于。

d1 = {'object': ['A1', 'B2', 'C1'], 
      'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
      'end_date': ['2019-03-15', '2009-10-28', '2024-12-22']
     }
df1 = pandas.DataFrame(data=d1)

另一个数据框只是一个日历

```python
df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df["first_day_of_month"] = df["date"].to_numpy().astype('datetime64[M]')

我想要的结果类似于以下SQL:

SELECT * FROM df1 LEFT JOIN df2
ON df1.start_date <= df2.date AND df1.end_date >= df2.date
WHERE df2.first_day_of_month = 1

因此,df1中的每一行都会与所有在df1.start_date和df2.end_date之间的月份的df2.first_day_of_month重复。

如何实现这个结果?

编辑:结果数据框将如下所示:

Object start_date end_date df2.date
A1 2019-01-19 2019-03-15 2019-02-01
A1 2019-01-19 2019-03-15 2019-03-01
B2 2009-07-03 2009-10-28 2009-08-01
B2 2009-07-03 2009-10-28 2009-09-01
B2 2009-07-03 2009-10-28 2009-10-01

<details>
<summary>英文:</summary>

I have 2 dataframes that I want to perform a simple full outer left join with condition less than or equal. 
 

    d1 = {&#39;object&#39;: [&#39;A1&#39;, &#39;B2&#39;, &#39;C1&#39;], 
          &#39;start_date&#39;: [&#39;2019-01-19&#39;, &#39;2009-07-03&#39;, &#39;2021-11-23&#39;],
          &#39;end_date&#39;: [&#39;2019-03-15&#39;, &#39;2009-10-28&#39;, &#39;2024-12-22&#39;]
         }
    df1 = pandas.DataFrame(data=d1)
 

The other dataframe is simply a calendar:

    df2 = pd.DataFrame({&quot;date&quot;: pd.date_range(&#39;1990-01-01&#39;, &#39;2050-12-31&#39;)})
    df[&quot;first_day_of_month&quot;] = df[&quot;date&quot;].to_numpy().astype(&#39;datetime64[M]&#39;)


I would like to get results that are similar with this type of sql:

    SELECT * FROM df1 LEFT JOIN df2
    ON df1.start_date &lt;= df2.date AND df1.end_date &gt;= df2.date
    WHERE df2.first_day_of_month = 1
    
So every row in df1 repeated with all the ```df2.first_day_of_month```
of the months that are between ```df1.start_date``` and ```df2.end_date```.

How can I achieve this result?

Edit: the resulted dataframe would be like:

| Object   | start_date     | end_date     | df2.date     |
| -------- | -------------- |--------------|--------------|
| A1       | 2019-01-19     | 2019-03-15   | 2019-02-01   |
| A1       | 2019-01-19     | 2019-03-15   | 2019-03-01   |
| B2       | 2009-07-03     | 2009-10-28   | 2009-08-01   |
| B2       | 2009-07-03     | 2009-10-28   | 2009-09-01   |
| B2       | 2009-07-03     | 2009-10-28   | 2009-10-01   |

</details>


# 答案1
**得分**: 1

你可以使用 [`janitor`](https://pyjanitor-devs.github.io/pyjanitor) 的 [`conditional_join`](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join) 方法:

```python
# pip install janitor
import janitor

df1['start_date'] = pd.to_datetime(df1['start_date'])
df1['end_date'] = pd.to_datetime(df1['end_date'])
df2['date'] = pd.to_datetime(df2['date'])

out = df1.conditional_join(df2[df2['first_day_of_month'].dt.day.eq(1)],
                           ('start_date', 'date', '<='),
                           ('end_date', 'date', '>='),
                           how='left')

输出:

     object start_date   end_date       date first_day_of_month
0        A1 2019-01-19 2022-12-15 2019-01-19         2019-01-01
1        A1 2019-01-19 2022-12-15 2019-01-20         2019-01-01
2        A1 2019-01-19 2022-12-15 2019-01-21         2019-01-01
3        A1 2019-01-19 2022-12-15 2019-01-22         2019-01-01
4        A1 2019-01-19 2022-12-15 2019-01-23         2019-01-01
...     ...        ...        ...        ...                ...
5435     C1 2021-11-23 2024-12-22 2024-12-18         2024-12-01
5436     C1 2021-11-23 2024-12-22 2024-12-19         2024-12-01
5437     C1 2021-11-23 2024-12-22 2024-12-20         2024-12-01
5438     C1 2021-11-23 2024-12-22 2024-12-21         2024-12-01
5439     C1 2021-11-23 2024-12-22 2024-12-22         2024-12-01

[5440 rows x 5 columns]
英文:

You can use janitor's conditional_join:

# pip install janitor
import janitor
df1[&#39;start_date&#39;] = pd.to_datetime(df1[&#39;start_date&#39;])
df1[&#39;end_date&#39;] = pd.to_datetime(df1[&#39;end_date&#39;])
df2[&#39;date&#39;] = pd.to_datetime(df2[&#39;date&#39;])
out = df1.conditional_join(df2[df2[&#39;first_day_of_month&#39;].dt.day.eq(1)],
(&#39;start_date&#39;, &#39;date&#39;, &#39;&lt;=&#39;),
(&#39;end_date&#39;, &#39;date&#39;, &#39;&gt;=&#39;),
how=&#39;left&#39;)

Output:

     object start_date   end_date       date first_day_of_month
0        A1 2019-01-19 2022-12-15 2019-01-19         2019-01-01
1        A1 2019-01-19 2022-12-15 2019-01-20         2019-01-01
2        A1 2019-01-19 2022-12-15 2019-01-21         2019-01-01
3        A1 2019-01-19 2022-12-15 2019-01-22         2019-01-01
4        A1 2019-01-19 2022-12-15 2019-01-23         2019-01-01
...     ...        ...        ...        ...                ...
5435     C1 2021-11-23 2024-12-22 2024-12-18         2024-12-01
5436     C1 2021-11-23 2024-12-22 2024-12-19         2024-12-01
5437     C1 2021-11-23 2024-12-22 2024-12-20         2024-12-01
5438     C1 2021-11-23 2024-12-22 2024-12-21         2024-12-01
5439     C1 2021-11-23 2024-12-22 2024-12-22         2024-12-01
[5440 rows x 5 columns]

答案2

得分: 0

以下是翻译好的代码部分:

import pandas as pd
import numpy as np

d1 = {
    'object': ['A1', 'B2', 'C1'],
    'start_date': ['2019-01-19', '2009-07-03', '2021-11-23'],
    'end_date': ['2022-12-15', '2017-05-28', '2024-12-22']
}

df1 = pd.DataFrame(data=d1)

df2 = pd.DataFrame({"date": pd.date_range('1990-01-01', '2050-12-31')})
df2_filtered = df2[df2["date"].dt.day == 1]

# 创建一个用于连接的公共键
df1['key'] = 1
df2_filtered['key'] = 1

# 执行交叉连接
merged_df = pd.merge(df1, df2_filtered, on='key', how='outer')

# 删除键列
merged_df = merged_df.drop(columns='key')

# 应用条件
filtered_df = merged_df[
    (merged_df['start_date'] <= merged_df['date']) & (merged_df['end_date'] >= merged_df['date'])
]

filtered_df

# 重置索引
filtered_df = filtered_df.reset_index(drop=True)

# 打印结果数据框
print(filtered_df)

希望这对你有所帮助。

英文:

Hello there you can use the following logic 全外连接与小于/大于或等于条件的左连接。

import pandas as pd
import numpy as np
d1 = {
&#39;object&#39;: [&#39;A1&#39;, &#39;B2&#39;, &#39;C1&#39;],
&#39;start_date&#39;: [&#39;2019-01-19&#39;, &#39;2009-07-03&#39;, &#39;2021-11-23&#39;],
&#39;end_date&#39;: [&#39;2022-12-15&#39;, &#39;2017-05-28&#39;, &#39;2024-12-22&#39;]
}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame({&quot;date&quot;: pd.date_range(&#39;1990-01-01&#39;, &#39;2050-12-31&#39;)})
df2_filtered = df2[df2[&quot;date&quot;].dt.day == 1]
# Create a common key for joining
df1[&#39;key&#39;] = 1
df2_filtered[&#39;key&#39;] = 1
# Perform the cross join
merged_df = pd.merge(df1, df2_filtered, on=&#39;key&#39;, how=&#39;outer&#39;)
# Drop the key column
merged_df = merged_df.drop(columns=&#39;key&#39;)
# Apply the conditions
filtered_df = merged_df[
(merged_df[&#39;start_date&#39;] &lt;= merged_df[&#39;date&#39;]) &amp; (merged_df[&#39;end_date&#39;] &gt;= merged_df[&#39;date&#39;])
]
filtered_df
# Reset the index
filtered_df = filtered_df.reset_index(drop=True)
# Print the resulting dataframe
print(filtered_df)

全外连接与小于/大于或等于条件的左连接。

huangapple
  • 本文由 发表于 2023年5月25日 15:23:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76329790.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定