根据重复项和日期删除行(pandas)

huangapple go评论71阅读模式
英文:

Delete rows based on duplicates AND dates (pandas)

问题

我有一个数据框如下:

df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
4 28865465787 30x RU NaN 2014-01-23T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z


我想要删除一行,只有当它在列X1和Y1中是另一行的重复项,并且它在'日期'列中的日期与另一行的日期相隔不超过4周时。

希望结果如下:

df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z


删除X1和Y1列中的重复项很简单,但我无法弄清如何将其与日期的滚动4周窗口结合起来。
英文:

I have a dataframe as such:

df = """
    doc_id      X1    Y1  Z1        Date
2  33075059515  30x   RU  5         2014-01-15T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
4  28865465787  30x   RU  NaN       2014-01-23T00:00:00Z	
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z	
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z
8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z	

I would like to delete a row only when it is a duplicate of another row in column X1 and Y1 AND when it has a date in the 'Date' column that is within 4 weeks of another row.

The hope is for it to look like this:

df = """
    doc_id      X1    Y1  Z1        Date
2  33075059515  30x   RU  5         2014-01-15T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z	
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z
8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z	

Deleting duplicates for column X1 and Y1 is simple, but I cannot figure out how to combine it with a rolling 4 week window for the date.

EDIT as per Mozway's request. This is a new example of the input:

    doc_id      X1    Y1  Z1        Date
1  68754534654  30x   RU  5         2014-01-14T00:00:00Z
2  33075059515  30x   RU  5         2014-01-15T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
4  28865465787  30x   RU  NaN       2014-01-23T00:00:00Z	
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z	
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z
7  78764599568  50x   SO  NaN       2016-10-27T12:00:00Z
8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z

Here an example of the output I am getting which is not yet the desired output (the rows in bold should also have been removed):

    doc_id      X1    Y1  Z1        Date
1  68754534654  30x   RU  5         2014-01-14T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
**4  28865465787  30x   RU  NaN       2014-01-23T00:00:00Z**	
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z	
7  78764599568  50x   SO  NaN       2016-10-27T12:00:00Z
**8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z**

</details>


# 答案1
**得分**: 2

使用[`merge_asof`](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html)来识别两个键上哪些行在阈值内有匹配,然后进行[布尔索引](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing):

```python
df2 = (df.assign(Date=pd.to_datetime(df['Date']))
         .sort_values(by='Date').reset_index()
      )

keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, '_'),
                      allow_exact_matches=False, tolerance=pd.Timedelta('4W'))
          .set_index('index')
          ['index_'].isna()
       )

out = df[keep]

输出:

        doc_id   X1  Y1   Z1                      Date
2  33075059515  30x  RU  5.0 2014-01-15 00:00:00+00:00
3  53075235984  50x  RU  6.0 2011-01-09 00:00:00+00:00
5  28865465787  30x  SO  5.0 2021-06-03 12:00:00+00:00
6  28865465787  50x  SO  4.0 2011-02-03 00:00:00+00:00
8  87652548931  50x  SO  NaN 2016-10-27 12:00:00+00:00

中间的keep:

index
3     True
6     True
2     True
4    False
8     True
5     True
Name: doc_id_, dtype: bool
英文:

Use a merge_asof to identify which rows have a match on the two keys within the threshold, then boolean indexing:

df2 = (df.assign(Date=pd.to_datetime(df[&#39;Date&#39;]))
         .sort_values(by=&#39;Date&#39;).reset_index()
      )

keep = (pd.merge_asof(df2, df2, on=&#39;Date&#39;, by=[&#39;X1&#39;, &#39;Y1&#39;], suffixes=(None, &#39;_&#39;),
                      allow_exact_matches=False, tolerance=pd.Timedelta(&#39;4W&#39;))
          .set_index(&#39;index&#39;)
          [&#39;index_&#39;].isna()
       )

out = df[keep]

Output:

        doc_id   X1  Y1   Z1                      Date
2  33075059515  30x  RU  5.0 2014-01-15 00:00:00+00:00
3  53075235984  50x  RU  6.0 2011-01-09 00:00:00+00:00
5  28865465787  30x  SO  5.0 2021-06-03 12:00:00+00:00
6  28865465787  50x  SO  4.0 2011-02-03 00:00:00+00:00
8  87652548931  50x  SO  NaN 2016-10-27 12:00:00+00:00

Intermediate keep:

index
3     True
6     True
2     True
4    False
8     True
5     True
Name: doc_id_, dtype: bool

答案2

得分: 0

Mozway的答案在大部分情况下是有效的。我不得不更改代码的最后一行。总的来说,这是有效的代码:

df2 = (df.assign(Date=pd.to_datetime(df['Date']))
         .sort_values(by='Date').reset_index()
      )

keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, '_'), allow_exact_matches=False, tolerance= pd.Timedelta('4W')).set_index('index')['index_'].isna())

out = df2[keep.reindex(df2.index, fill_value=False)]
英文:

Mozway's answer worked for the most part. I had to change the final line of code. In total, this is the code that worked:

df2 = (df.assign(Date=pd.to_datetime(df['Date']))
.sort_values(by='Date').reset_index()
)

keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, ''), allow_exact_matches=False, tolerance= pd.Timedelta('4W')).set_index('index')['index'].isna())

out = df2[keep.reindex(df2.index, fill_value=False)]

huangapple
  • 本文由 发表于 2023年3月9日 20:44:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684807.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定