如何在时间间隔内去除重复项

huangapple go评论109阅读模式
英文:

How to remove duplicies within time interval

问题

Here is the translated code part:

  1. df1 = pd.DataFrame({
  2. 'IN': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
  3. 'OUT': ['2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10'],
  4. 'Ticker': ['AAPL', 'AAPL', 'GOOG', 'GOOG']
  5. })
  6. df2 = pd.DataFrame({
  7. 'IN': ['2023-01-05', '2023-05-01', '2023-02-05', '2023-05-01'],
  8. 'OUT': ['2023-01-15', '2023-05-15', '2023-02-15', '2023-05-15'],
  9. 'Ticker': ['AAPL', 'GOOG', 'MSFT', 'XXXX']
  10. })

And here's the translation of the code you provided:

  1. df1 = df1[~((df1['Ticker'].isin(df2['Ticker'])) & (df1['IN'].between(df2['OUT'], df2['OUT'])))]

Please note that this code is written in Python and assumes that you have the necessary libraries like pandas imported in your environment.

英文:

I have a two pandas dataframes, let's say:

  1. df1 = pd.DataFrame({
  2. 'IN': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
  3. 'OUT': ['2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10'],
  4. 'Ticker': ['AAPL', 'AAPL', 'GOOG', 'GOOG']
  5. })
  6. df2 = pd.DataFrame({
  7. 'IN': ['2023-01-05', '2023-05-01', '2023-02-05', '2023-05-01'],
  8. 'OUT': ['2023-01-15', '2023-05-15', '2023-02-15', '2023-05-15'],
  9. 'Ticker': ['AAPL', 'GOOG', 'MSFT', 'XXXX']
  10. })

The question is how to remove (or copy index for later drop) from df2 such records which are already in df1 (let's say like open trades) between interval IN-OUT.

E.g. the first trade/row in df1 is AAPL from 2023-01-01 to 2023-01-10, therefore the first trade in df2 must be removed because its interval is 2023-01-05 to 2023-01-15. But the second trade/row must be kept.

Does exists a way how to do it simply without iterations?

I have tried something like:

  1. df1 = df1[~((df1['Ticker'].isin(df2['Ticker'])) & (df1['IN'].between(df2['OUT'], df2['OUT'])))]

but did not get right result and besides, it does not work if number of rows of dataframes are different.

答案1

得分: 0

你可以使用 merge 来匹配数据框之间的股票代码,然后使用 query 来保留你想要删除的行:

  1. idx_to_drop = (df2.reset_index().merge(df1, on='Ticker')
  2. .query('(IN_y > IN_x)')['index'].tolist())
  3. out = df2.drop(idx_to_drop)

输出:

  1. >>> out
  2. IN OUT Ticker
  3. 1 2023-05-01 2023-05-15 GOOG
  4. 2 2023-02-05 2023-02-15 MSFT
  5. 3 2023-05-01 2023-05-15 XXXX

中间步骤:

  1. >>> df2.reset_index().merge(df1, on='Ticker')
  2. index IN_x OUT_x Ticker IN_y OUT_y
  3. 0 0 2023-01-05 2023-01-15 AAPL 2023-01-01 2023-01-10
  4. 1 0 2023-01-05 2023-01-15 AAPL 2023-02-01 2023-02-10
  5. 2 1 2023-05-01 2023-05-15 GOOG 2023-03-01 2023-03-10
  6. 3 1 2023-05-01 2023-05-15 GOOG 2023-04-01 2023-04-10
英文:

You can use merge to match tickers between dataframes then use query to keep rows you want to drop:

  1. idx_to_drop = (df2.reset_index().merge(df1, on='Ticker')
  2. .query('(IN_y > IN_x)')['index'].tolist())
  3. out = df2.drop(idx_to_drop)

Output:

  1. >>> out
  2. IN OUT Ticker
  3. 1 2023-05-01 2023-05-15 GOOG
  4. 2 2023-02-05 2023-02-15 MSFT
  5. 3 2023-05-01 2023-05-15 XXXX

Intermediate step:

  1. >>> df2.reset_index().merge(df1, on='Ticker')
  2. index IN_x OUT_x Ticker IN_y OUT_y
  3. 0 0 2023-01-05 2023-01-15 AAPL 2023-01-01 2023-01-10
  4. 1 0 2023-01-05 2023-01-15 AAPL 2023-02-01 2023-02-10
  5. 2 1 2023-05-01 2023-05-15 GOOG 2023-03-01 2023-03-10
  6. 3 1 2023-05-01 2023-05-15 GOOG 2023-04-01 2023-04-10

huangapple
  • 本文由 发表于 2023年6月29日 17:14:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579690.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定