if first value is zero in one dataframe set previous values to 1 in another dataframe on condition

huangapple go评论106阅读模式
英文:

if first value is zero in one dataframe set previous values to 1 in another dataframe on condition

问题

我有2个数据框,df1和df2,我想根据df1中的条件更改df2中的值

df1

  1. 名称 日期 标志
  2. 0 abc 4/11/2023 1
  3. 1 xyz 2/8/2023 0

df2:

  1. 名称 日期 标志
  2. 0 xyz 2/6/2023 0
  3. 1 xyz 2/7/2023 0
  4. 2 xyz 2/8/2023 0
  5. 3 xyz 2/9/2023 1
  6. 4 xyz 2/10/2023 1
  7. 5 xyz 2/11/2023 1
  8. 6 xyz 2/12/2023 1
  9. 7 xyz 2/13/2023 1

在df1中,对于'xyz',标志在2/8/2023上为0,因此在df2中小于df1中的日期应该为1

预期输出:

if first value is zero in one dataframe set previous values to 1 in another dataframe on condition

英文:

i have 2 dataframes, df1 and df2 i want to change the values of df2 based on a condition from df1

df1

  1. name date flag
  2. 0 abc 4/11/2023 1
  3. 1 xyz 2/8/2023 0

df2:

  1. name date flag
  2. 0 xyz 2/6/2023 0
  3. 1 xyz 2/7/2023 0
  4. 2 xyz 2/8/2023 0
  5. 3 xyz 2/9/2023 1
  6. 4 xyz 2/10/2023 1
  7. 5 xyz 2/11/2023 1
  8. 6 xyz 2/12/2023 1
  9. 7 xyz 2/13/2023 1

in df1 for 'xyz', the flag is 0 on 2/8/2023 hence in df2 dates less than the date in df1 should be 1

expected output

if first value is zero in one dataframe set previous values to 1 in another dataframe on condition

I am new to python and want to do it using pandas functions

答案1

得分: 1

以下是代码部分的翻译:

The exact logic is unclear, but you need to use a merge_asof to determine if there is a match per name with a later date:
确切的逻辑不清楚,但您需要使用 merge_asof 来确定是否存在一个与后续日期匹配的名称:

  1. # ensure datetime
  2. df1['date'] = pd.to_datetime(df1['date'], dayfirst=False)
  3. df2['date'] = pd.to_datetime(df2['date'], dayfirst=False)
  4. out = (pd.merge_asof(df2.reset_index().sort_values(by='date'),
  5. df1.sort_values(by='date'),
  6. by='name', on='date', direction='forward',
  7. allow_exact_matches=False
  8. )
  9. .set_index('index').reindex(df2.index)
  10. .assign(flag=lambda d: d.pop('flag_x').mask(d.pop('flag_y').notna(), 1))
  11. )

Output:
输出:

  1. name date flag
  2. index
  3. 0 xyz 2023-02-06 1
  4. 1 xyz 2023-02-07 1
  5. 2 xyz 2023-02-08 0
  6. 3 xyz 2023-02-09 1
  7. 4 xyz 2023-02-10 1
  8. 5 xyz 2023-02-11 1
  9. 6 xyz 2023-02-12 1
  10. 7 xyz 2023-02-13 1

Intermediate before the assign:
assign 之前的中间结果:

  1. name date flag_x flag_y
  2. index
  3. 0 xyz 2023-02-06 0 0.0
  4. 1 xyz 2023-02-07 0 0.0
  5. 2 xyz 2023-02-08 0 NaN
  6. 3 xyz 2023-02-09 1 NaN
  7. 4 xyz 2023-02-10 1 NaN
  8. 5 xyz 2023-02-11 1 NaN
  9. 6 xyz 2023-02-12 1 NaN
  10. 7 xyz 2023-02-13 1 NaN

注意,如果需要,您可以使用更复杂的逻辑,"flag_y" 中的值是匹配日期的值(这里是 2023-02-08 对于索引 0 和 2)。

只考虑 df1 中每个名称的一个日期,或仅考虑最大日期

如果 df1 中每个名称只有一个日期,那么您可以简化为:

  1. df1['date'] = pd.to_datetime(df1['date'], dayfirst=False)
  2. df2['date'] = pd.to_datetime(df2['date'], dayfirst=False)
  3. m = df2['date'].lt(df2['name'].map(df1.set_index('name')['date']))
  4. df.loc[m, 'flag'] = 1

或者,如果有多个日期,而您只想考虑每个名称的最大日期:

  1. m = df2['date'].lt(df2['name'].map(df1.groupby('name')['date'].max()))
  2. df.loc[m, 'flag'] = 1
英文:

The exact logic is unclear, but you need to use a merge_asof to determine if there is a match per name with an later date:

  1. # ensure datetime
  2. df1['date'] = pd.to_datetime(df1['date'], dayfirst=False)
  3. df2['date'] = pd.to_datetime(df2['date'], dayfirst=False)
  4. out = (pd.merge_asof(df2.reset_index().sort_values(by='date'),
  5. df1.sort_values(by='date'),
  6. by='name', on='date', direction='forward',
  7. allow_exact_matches=False
  8. )
  9. .set_index('index').reindex(df2.index)
  10. .assign(flag=lambda d: d.pop('flag_x').mask(d.pop('flag_y').notna(), 1))
  11. )

Output:

  1. name date flag
  2. index
  3. 0 xyz 2023-02-06 1
  4. 1 xyz 2023-02-07 1
  5. 2 xyz 2023-02-08 0
  6. 3 xyz 2023-02-09 1
  7. 4 xyz 2023-02-10 1
  8. 5 xyz 2023-02-11 1
  9. 6 xyz 2023-02-12 1
  10. 7 xyz 2023-02-13 1

Intermediate before the assign:

  1. name date flag_x flag_y
  2. index
  3. 0 xyz 2023-02-06 0 0.0
  4. 1 xyz 2023-02-07 0 0.0
  5. 2 xyz 2023-02-08 0 NaN
  6. 3 xyz 2023-02-09 1 NaN
  7. 4 xyz 2023-02-10 1 NaN
  8. 5 xyz 2023-02-11 1 NaN
  9. 6 xyz 2023-02-12 1 NaN
  10. 7 xyz 2023-02-13 1 NaN

Note that you can use a more complex logic if needed, the value in "flag_y" is the value of the matching date (here of 2023-02-08 for indices 0 and 2).

only one date per name in df1, or only considering the max date

If df1 only has one date per name, then you can simplify to:

  1. df1['date'] = pd.to_datetime(df1['date'], dayfirst=False)
  2. df2['date'] = pd.to_datetime(df2['date'], dayfirst=False)
  3. m = df2['date'].lt(df2['name'].map(df1.set_index('name')['date']))
  4. df.loc[m, 'flag'] = 1

Or, if several dates and you only want consider the max date per name:

  1. m = df2['date'].lt(df2['name'].map(df1.groupby('name')['date'].max()))
  2. df.loc[m, 'flag'] = 1

huangapple
  • 本文由 发表于 2023年7月3日 19:09:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76604175.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定