Pandas isin()在数字值上不正常工作。

huangapple go评论97阅读模式
英文:

Pandas isin() not working properly with numerical values

问题

我有一个pandas数据帧,其中一列全是浮点数,另一列包含浮点数列表、None或只是浮点数值。我已确保所有值都是浮点数。

最终,我想使用pd.isin()来检查value_2中有多少记录的value_1,但对我来说没有效果。当我运行下面的代码时:

  1. df[~df['value_1'].isin(df['value_2'])]

下面是它返回的结果,这不是预期的,因为显然value_1中的某些值在value_2列表中:

  1. value_1 value_2
  2. 0 88870.0 [88870.0]
  3. 1. 150700.0 None
  4. 2 225000.0 [225000.0, 225000.0]
  5. 3. 305000.0 [305606.0, 305000.0, 1067.5]
  6. 4 392000.0 [392000.0]
  7. 5 198400.0 396

我漏掉了什么?请帮助。

英文:

I have a pandas dataframe where one column is all float, another column either contains list of floats, None, or just float values. I have ensured all values are floats.

Ultimately, I want to use pd.isin() to check how many records of value_1 are in value_2 but it is not working for me. When I ran this code below:

  1. df[~df['value_1'].isin(df['value_2'])]

This below is what it returned which is not expected since clearly some values in value_1 are in the value_2 lists.:

  1. 0 88870.0 [88870.0]
  2. 1. 150700.0 None
  3. 2 225000.0 [225000.0, 225000.0]
  4. 3. 305000.0 [305606.0, 305000.0, 1067.5]
  5. 4 392000.0 [392000.0]
  6. 5 198400.0 396

What am I missing? Please help.

答案1

得分: 2

您可以在列表推导式中使用numpy.isin进行布尔索引

  1. import numpy as np
  2. out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

输出:

  1. value_1 value_2
  2. 0 88870.0 [88870.0]
  3. 2 225000.0 [225000.0, 225000.0]
  4. 3 305000.0 [305606.0, 305000.0, 1067.5]
  5. 4 392000.0 [392000.0]
英文:

You can use boolean indexing with numpy.isin in a list comprehension:

  1. import numpy as np
  2. out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

Output:

  1. value_1 value_2
  2. 0 88870.0 [88870.0]
  3. 2 225000.0 [225000.0, 225000.0]
  4. 3 305000.0 [305606.0, 305000.0, 1067.5]
  5. 4 392000.0 [392000.0]

答案2

得分: 1

使用zip和列表推导来测试列表是否不包含浮点数,如果不包含浮点数,则通过传递False来删除行,使用布尔索引进行筛选:

  1. df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
  2. 'value_2':[[88870.0],None, [88870.0,45.4]]})
  3. print(df)

输出:

  1. value_1 value_2
  2. 0 88870.0 [88870.0]
  3. 1 150700.0 None
  4. 2 392000.0 [88870.0, 45.4]

针对测试标量值的需求:

  1. mask = [a not in b if isinstance(b, list) else a != b
  2. for a, b in zip(df['value_1'], df['value_2'])]
  3. df2 = df[mask]
  4. print(df2)

输出:

  1. value_1 value_2
  2. 1 150700.0 None
  3. 2 392000.0 [88870.0, 45.4]

性能方面:纯Python应该更快,最好在真实数据中进行测试:

  1. # 20k行
  2. N = 10000
  3. df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
  4. 'value_2':[[88870.0],None, [88870.0,45.4]] * N})
  5. # 在真实数据中进行性能测试
  6. %timeit df[[a not in b if isinstance(b, list) else a != b for a, b in zip(df['value_1'], df['value_2'])]]
  7. %timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

请注意,这是您提供的代码的翻译。

英文:

Use zip with list comprehension for test if lists not contains floats, if not lists are removed rows by passing False, filter in boolean indexing:

  1. df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
  2. 'value_2':[[88870.0],None, [88870.0,45.4]]})
  3. print (df)
  4. value_1 value_2
  5. 0 88870.0 [88870.0]
  6. 1 150700.0 None
  7. 2 392000.0 [88870.0, 45.4]
  8. mask = [a not in b if isinstance(b, list) else False
  9. for a, b in zip(df['value_1'], df['value_2'])]
  10. df1 = df[mask]
  11. print (df1)
  12. value_1 value_2
  13. 2 392000.0 [88870.0, 45.4]

If need also test scalars:

  1. mask = [a not in b if isinstance(b, list) else a != b
  2. for a, b in zip(df['value_1'], df['value_2'])]
  3. df2 = df[mask]
  4. print (df2)
  5. value_1 value_2
  6. 1 150700.0 None
  7. 2 392000.0 [88870.0, 45.4]

Performance: Pure python should be faster, best test in real data:

  1. #20k rows
  2. N = 10000
  3. df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
  4. 'value_2':[[88870.0],None, [88870.0,45.4]] * N})
  5. print (df)
  6. In [51]: %timeit df[[a not in b if isinstance(b, list) else a != b for a, b in zip(df['value_1'], df['value_2'])]]
  7. 18.8 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  8. In [52]: %timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
  9. 419 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

huangapple
  • 本文由 发表于 2023年2月8日 15:13:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75382434.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定