如何在时间间隔内去除重复项

huangapple go评论81阅读模式
英文:

How to remove duplicies within time interval

问题

Here is the translated code part:

df1 = pd.DataFrame({
    'IN': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
    'OUT': ['2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10'],
    'Ticker': ['AAPL', 'AAPL', 'GOOG', 'GOOG']
})

df2 = pd.DataFrame({
    'IN': ['2023-01-05', '2023-05-01', '2023-02-05', '2023-05-01'],
    'OUT': ['2023-01-15', '2023-05-15', '2023-02-15', '2023-05-15'],
    'Ticker': ['AAPL', 'GOOG', 'MSFT', 'XXXX']
})

And here's the translation of the code you provided:

df1 = df1[~((df1['Ticker'].isin(df2['Ticker'])) & (df1['IN'].between(df2['OUT'], df2['OUT'])))]

Please note that this code is written in Python and assumes that you have the necessary libraries like pandas imported in your environment.

英文:

I have a two pandas dataframes, let's say:

df1 = pd.DataFrame({
    'IN': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
    'OUT': ['2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10'],
    'Ticker': ['AAPL', 'AAPL', 'GOOG', 'GOOG']
})

df2 = pd.DataFrame({
    'IN': ['2023-01-05', '2023-05-01', '2023-02-05', '2023-05-01'],
    'OUT': ['2023-01-15', '2023-05-15', '2023-02-15', '2023-05-15'],
    'Ticker': ['AAPL', 'GOOG', 'MSFT', 'XXXX']
})

The question is how to remove (or copy index for later drop) from df2 such records which are already in df1 (let's say like open trades) between interval IN-OUT.

E.g. the first trade/row in df1 is AAPL from 2023-01-01 to 2023-01-10, therefore the first trade in df2 must be removed because its interval is 2023-01-05 to 2023-01-15. But the second trade/row must be kept.

Does exists a way how to do it simply without iterations?

I have tried something like:

df1 = df1[~((df1['Ticker'].isin(df2['Ticker'])) & (df1['IN'].between(df2['OUT'], df2['OUT'])))]

but did not get right result and besides, it does not work if number of rows of dataframes are different.

答案1

得分: 0

你可以使用 merge 来匹配数据框之间的股票代码,然后使用 query 来保留你想要删除的行:

idx_to_drop = (df2.reset_index().merge(df1, on='Ticker')
                  .query('(IN_y > IN_x)')['index'].tolist())
out = df2.drop(idx_to_drop)

输出:

>>> out
          IN        OUT Ticker
1 2023-05-01 2023-05-15   GOOG
2 2023-02-05 2023-02-15   MSFT
3 2023-05-01 2023-05-15   XXXX

中间步骤:

>>> df2.reset_index().merge(df1, on='Ticker')
   index       IN_x      OUT_x Ticker       IN_y      OUT_y
0      0 2023-01-05 2023-01-15   AAPL 2023-01-01 2023-01-10
1      0 2023-01-05 2023-01-15   AAPL 2023-02-01 2023-02-10
2      1 2023-05-01 2023-05-15   GOOG 2023-03-01 2023-03-10
3      1 2023-05-01 2023-05-15   GOOG 2023-04-01 2023-04-10
英文:

You can use merge to match tickers between dataframes then use query to keep rows you want to drop:

idx_to_drop = (df2.reset_index().merge(df1, on='Ticker')
                  .query('(IN_y > IN_x)')['index'].tolist())
out = df2.drop(idx_to_drop)

Output:

>>> out
          IN        OUT Ticker
1 2023-05-01 2023-05-15   GOOG
2 2023-02-05 2023-02-15   MSFT
3 2023-05-01 2023-05-15   XXXX

Intermediate step:

>>> df2.reset_index().merge(df1, on='Ticker')
   index       IN_x      OUT_x Ticker       IN_y      OUT_y
0      0 2023-01-05 2023-01-15   AAPL 2023-01-01 2023-01-10
1      0 2023-01-05 2023-01-15   AAPL 2023-02-01 2023-02-10
2      1 2023-05-01 2023-05-15   GOOG 2023-03-01 2023-03-10
3      1 2023-05-01 2023-05-15   GOOG 2023-04-01 2023-04-10

huangapple
  • 本文由 发表于 2023年6月29日 17:14:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579690.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定