英文:
Delete rows based on duplicates AND dates (pandas)
问题
我有一个数据框如下:
df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
4 28865465787 30x RU NaN 2014-01-23T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z
我想要删除一行,只有当它在列X1和Y1中是另一行的重复项,并且它在'日期'列中的日期与另一行的日期相隔不超过4周时。
希望结果如下:
df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z
删除X1和Y1列中的重复项很简单,但我无法弄清如何将其与日期的滚动4周窗口结合起来。
英文:
I have a dataframe as such:
df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
4 28865465787 30x RU NaN 2014-01-23T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z
I would like to delete a row only when it is a duplicate of another row in column X1 and Y1 AND when it has a date in the 'Date' column that is within 4 weeks of another row.
The hope is for it to look like this:
df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z
Deleting duplicates for column X1 and Y1 is simple, but I cannot figure out how to combine it with a rolling 4 week window for the date.
EDIT as per Mozway's request. This is a new example of the input:
doc_id X1 Y1 Z1 Date
1 68754534654 30x RU 5 2014-01-14T00:00:00Z
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
4 28865465787 30x RU NaN 2014-01-23T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
7 78764599568 50x SO NaN 2016-10-27T12:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z
Here an example of the output I am getting which is not yet the desired output (the rows in bold should also have been removed):
doc_id X1 Y1 Z1 Date
1 68754534654 30x RU 5 2014-01-14T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
**4 28865465787 30x RU NaN 2014-01-23T00:00:00Z**
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
7 78764599568 50x SO NaN 2016-10-27T12:00:00Z
**8 87652548931 50x SO NaN 2016-10-27T12:00:00Z**
</details>
# 答案1
**得分**: 2
使用[`merge_asof`](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html)来识别两个键上哪些行在阈值内有匹配,然后进行[布尔索引](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing):
```python
df2 = (df.assign(Date=pd.to_datetime(df['Date']))
.sort_values(by='Date').reset_index()
)
keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, '_'),
allow_exact_matches=False, tolerance=pd.Timedelta('4W'))
.set_index('index')
['index_'].isna()
)
out = df[keep]
输出:
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5.0 2014-01-15 00:00:00+00:00
3 53075235984 50x RU 6.0 2011-01-09 00:00:00+00:00
5 28865465787 30x SO 5.0 2021-06-03 12:00:00+00:00
6 28865465787 50x SO 4.0 2011-02-03 00:00:00+00:00
8 87652548931 50x SO NaN 2016-10-27 12:00:00+00:00
中间的keep
:
index
3 True
6 True
2 True
4 False
8 True
5 True
Name: doc_id_, dtype: bool
英文:
Use a merge_asof
to identify which rows have a match on the two keys within the threshold, then boolean indexing:
df2 = (df.assign(Date=pd.to_datetime(df['Date']))
.sort_values(by='Date').reset_index()
)
keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, '_'),
allow_exact_matches=False, tolerance=pd.Timedelta('4W'))
.set_index('index')
['index_'].isna()
)
out = df[keep]
Output:
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5.0 2014-01-15 00:00:00+00:00
3 53075235984 50x RU 6.0 2011-01-09 00:00:00+00:00
5 28865465787 30x SO 5.0 2021-06-03 12:00:00+00:00
6 28865465787 50x SO 4.0 2011-02-03 00:00:00+00:00
8 87652548931 50x SO NaN 2016-10-27 12:00:00+00:00
Intermediate keep
:
index
3 True
6 True
2 True
4 False
8 True
5 True
Name: doc_id_, dtype: bool
答案2
得分: 0
Mozway的答案在大部分情况下是有效的。我不得不更改代码的最后一行。总的来说,这是有效的代码:
df2 = (df.assign(Date=pd.to_datetime(df['Date']))
.sort_values(by='Date').reset_index()
)
keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, '_'), allow_exact_matches=False, tolerance= pd.Timedelta('4W')).set_index('index')['index_'].isna())
out = df2[keep.reindex(df2.index, fill_value=False)]
英文:
Mozway's answer worked for the most part. I had to change the final line of code. In total, this is the code that worked:
df2 = (df.assign(Date=pd.to_datetime(df['Date']))
.sort_values(by='Date').reset_index()
)
keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, ''), allow_exact_matches=False, tolerance= pd.Timedelta('4W')).set_index('index')['index'].isna())
out = df2[keep.reindex(df2.index, fill_value=False)]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论