2023年3月9日 20:44:36go评论81阅读模式

英文:

Delete rows based on duplicates AND dates (pandas)

问题

我有一个数据框如下：

df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
4 28865465787 30x RU NaN 2014-01-23T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z


我想要删除一行，只有当它在列X1和Y1中是另一行的重复项，并且它在'日期'列中的日期与另一行的日期相隔不超过4周时。

希望结果如下：

df = """
doc_id X1 Y1 Z1 Date
2 33075059515 30x RU 5 2014-01-15T00:00:00Z
3 53075235984 50x RU 6 2011-01-09T00:00:00Z
5 28865465787 30x SO 5 2021-06-03T12:00:00Z
6 28865465787 50x SO 4 2011-02-03T00:00:00Z
8 87652548931 50x SO NaN 2016-10-27T12:00:00Z


删除X1和Y1列中的重复项很简单，但我无法弄清如何将其与日期的滚动4周窗口结合起来。

英文:

I have a dataframe as such:

df = &quot;&quot;&quot;
    doc_id      X1    Y1  Z1        Date
2  33075059515  30x   RU  5         2014-01-15T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
4  28865465787  30x   RU  NaN       2014-01-23T00:00:00Z	
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z	
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z
8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z

I would like to delete a row only when it is a duplicate of another row in column X1 and Y1 AND when it has a date in the 'Date' column that is within 4 weeks of another row.

The hope is for it to look like this:

df = &quot;&quot;&quot;
    doc_id      X1    Y1  Z1        Date
2  33075059515  30x   RU  5         2014-01-15T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z	
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z
8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z

Deleting duplicates for column X1 and Y1 is simple, but I cannot figure out how to combine it with a rolling 4 week window for the date.

EDIT as per Mozway's request. This is a new example of the input:

    doc_id      X1    Y1  Z1        Date
1  68754534654  30x   RU  5         2014-01-14T00:00:00Z
2  33075059515  30x   RU  5         2014-01-15T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
4  28865465787  30x   RU  NaN       2014-01-23T00:00:00Z	
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z	
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z
7  78764599568  50x   SO  NaN       2016-10-27T12:00:00Z
8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z

Here an example of the output I am getting which is not yet the desired output (the rows in bold should also have been removed):

    doc_id      X1    Y1  Z1        Date
1  68754534654  30x   RU  5         2014-01-14T00:00:00Z	
3  53075235984  50x   RU  6         2011-01-09T00:00:00Z
**4  28865465787  30x   RU  NaN       2014-01-23T00:00:00Z**	
5  28865465787  30x   SO  5         2021-06-03T12:00:00Z
6  28865465787  50x   SO  4         2011-02-03T00:00:00Z	
7  78764599568  50x   SO  NaN       2016-10-27T12:00:00Z
**8  87652548931  50x   SO  NaN       2016-10-27T12:00:00Z**

</details>


# 答案1
**得分**: 2

使用[`merge_asof`](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html)来识别两个键上哪些行在阈值内有匹配，然后进行[布尔索引](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing)：

```python
df2 = (df.assign(Date=pd.to_datetime(df['Date']))
         .sort_values(by='Date').reset_index()
      )

keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, '_'),
                      allow_exact_matches=False, tolerance=pd.Timedelta('4W'))
          .set_index('index')
          ['index_'].isna()
       )

out = df[keep]

输出:

        doc_id   X1  Y1   Z1                      Date
2  33075059515  30x  RU  5.0 2014-01-15 00:00:00+00:00
3  53075235984  50x  RU  6.0 2011-01-09 00:00:00+00:00
5  28865465787  30x  SO  5.0 2021-06-03 12:00:00+00:00
6  28865465787  50x  SO  4.0 2011-02-03 00:00:00+00:00
8  87652548931  50x  SO  NaN 2016-10-27 12:00:00+00:00

中间的keep:

index
3     True
6     True
2     True
4    False
8     True
5     True
Name: doc_id_, dtype: bool

英文:

Use a merge_asof to identify which rows have a match on the two keys within the threshold, then boolean indexing:

df2 = (df.assign(Date=pd.to_datetime(df[&#39;Date&#39;]))
         .sort_values(by=&#39;Date&#39;).reset_index()
      )

keep = (pd.merge_asof(df2, df2, on=&#39;Date&#39;, by=[&#39;X1&#39;, &#39;Y1&#39;], suffixes=(None, &#39;_&#39;),
                      allow_exact_matches=False, tolerance=pd.Timedelta(&#39;4W&#39;))
          .set_index(&#39;index&#39;)
          [&#39;index_&#39;].isna()
       )

out = df[keep]

Output:

        doc_id   X1  Y1   Z1                      Date
2  33075059515  30x  RU  5.0 2014-01-15 00:00:00+00:00
3  53075235984  50x  RU  6.0 2011-01-09 00:00:00+00:00
5  28865465787  30x  SO  5.0 2021-06-03 12:00:00+00:00
6  28865465787  50x  SO  4.0 2011-02-03 00:00:00+00:00
8  87652548931  50x  SO  NaN 2016-10-27 12:00:00+00:00

Intermediate keep:

index
3     True
6     True
2     True
4    False
8     True
5     True
Name: doc_id_, dtype: bool

答案2

得分: 0

Mozway的答案在大部分情况下是有效的。我不得不更改代码的最后一行。总的来说，这是有效的代码：

df2 = (df.assign(Date=pd.to_datetime(df['Date']))
         .sort_values(by='Date').reset_index()
      )

keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, '_'), allow_exact_matches=False, tolerance= pd.Timedelta('4W')).set_index('index')['index_'].isna())

out = df2[keep.reindex(df2.index, fill_value=False)]

英文:

Mozway's answer worked for the most part. I had to change the final line of code. In total, this is the code that worked:

df2 = (df.assign(Date=pd.to_datetime(df['Date']))
.sort_values(by='Date').reset_index()
)

keep = (pd.merge_asof(df2, df2, on='Date', by=['X1', 'Y1'], suffixes=(None, ''), allow_exact_matches=False, tolerance= pd.Timedelta('4W')).set_index('index')['index'].isna())

out = df2[keep.reindex(df2.index, fill_value=False)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据重复项和日期删除行（pandas）

问题

答案2

从现有的数据框中找到唯一日期，并创建一个带有相应列值的新CSV。

如何在Google表格中按每周的平均值？

将一个 Pandas 的 groupby 对象保存到一个 CSV 文件中。

找到 SQL 中一个月中第 n 次出现的某一天

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论