Python (pandas) – check if value in one df is between ANY pair in another (unequal) df

huangapple go评论91阅读模式
英文:

Python (pandas) - check if value in one df is between ANY pair in another (unequal) df

问题

以下是您要翻译的内容:

作为一个最简单的例子,考虑以下两个数据框(注意它们的大小不相等):

  1. df
  2. min_val max_val
  3. 0 0 4
  4. 1 5 9
  5. 2 10 14
  6. 3 15 19
  7. 4 20 24
  8. 5 25 29
  9. df1
  10. val
  11. 0 1
  12. 1 6
  13. 2 2
  14. 3 Nan
  15. 4 34

我正在尝试检查df1中的每个值是否可以在df中的任何一对中找到。输出应该是一个新的数据框,其中包含df1的val列,以及它所在的一对,再加上一个额外的列,名字可以叫做'within'和'not within'。因此,输出应该如下所示:

  1. val min_val max_val nameTag
  2. 0 1 0 4 within
  3. 1 6 5 9 within
  4. 2 2 0 4 within
  5. 3 Nan Nan Nan not within
  6. 4 34 Nan Nan not within

到目前为止,我找到的任何解决方案都是逐行搜索,错过了df1中的值2,而它在df中的一对0-4中(一些对我不起作用的帖子在此处,以及在此处)。

将不适用于我的任何指针/建议/解决方案将不胜感激。谢谢。

英文:

As a minimal example consider the following two df (notice their sizes are not equal):

  1. df
  2. min_val max_val
  3. 0 0 4
  4. 1 5 9
  5. 2 10 14
  6. 3 15 19
  7. 4 20 24
  8. 5 25 29
  9. df1
  10. val
  11. 0 1
  12. 1 6
  13. 2 2
  14. 3 Nan
  15. 4 34

I am trying to check whether each value in df1 can be found within any pair in df. The output should be a new dataframe that will contain the val column of df1 plus the pair within which it was found plus an extra column with a name tag let's say 'within' and 'not within'. So the output should look like:

  1. val min_val max_val nameTag
  2. 0 1 0 4 within
  3. 1 6 5 9 within
  4. 2 2 0 4 within
  5. 3 Nan Nan Nan not within
  6. 4 34 Nan Nan not within

So far, any solutions I have found do the searches line-by-line missing the val 2 in df1 which is within the pair 0-4 in df (some posts that did not work for me HERE, and HERE).

Any pointers/advice/solutions will be much appreciated.
Thanks

答案1

得分: 3

我将使用merge_asof函数:

  1. tmp = pd.merge_asof(df1.reset_index().sort_values(by='val').dropna(),
  2. df.sort_values(by='min_val').astype(float),
  3. left_on='val', right_on='min_val'
  4. ).set_index('index').reindex(df1.index)
  5. df1['nameTag'] = np.where(tmp['val'].le(tmp['max_val']), 'within', 'not within')

或者使用IntervalIndex

  1. s = pd.Series('within', pd.IntervalIndex.from_arrays(df['min_val'], df['max_val']))
  2. df1['nameTag'] = s.reindex(df1['val']).fillna('no within').to_numpy()

输出:

  1. val nameTag
  2. 0 1.0 within
  3. 1 6.0 within
  4. 2 2.0 within
  5. 3 NaN not within
  6. 4 34.0 not within
英文:

I would use a merge_asof:

  1. tmp = pd.merge_asof(df1.reset_index().sort_values(by='val').dropna(),
  2. df.sort_values(by='min_val').astype(float),
  3. left_on='val', right_on='min_val'
  4. ).set_index('index').reindex(df1.index)
  5. df1['nameTag'] = np.where(tmp['val'].le(tmp['max_val']), 'within', 'not within')

Or an IntervalIndex:

  1. s = pd.Series('within', pd.IntervalIndex.from_arrays(df['min_val'], df['max_val']))
  2. df1['nameTag'] =s.reindex(df1['val']).fillna('no within').to_numpy()

Output:

  1. val nameTag
  2. 0 1.0 within
  3. 1 6.0 within
  4. 2 2.0 within
  5. 3 NaN not within
  6. 4 34.0 not within

huangapple
  • 本文由 发表于 2023年5月22日 19:59:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76305949.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定