Pandas isin()在数字值上不正常工作。

huangapple go评论65阅读模式
英文:

Pandas isin() not working properly with numerical values

问题

我有一个pandas数据帧,其中一列全是浮点数,另一列包含浮点数列表、None或只是浮点数值。我已确保所有值都是浮点数。

最终,我想使用pd.isin()来检查value_2中有多少记录的value_1,但对我来说没有效果。当我运行下面的代码时:

df[~df['value_1'].isin(df['value_2'])]

下面是它返回的结果,这不是预期的,因为显然value_1中的某些值在value_2列表中:

   value_1     value_2
0     88870.0    [88870.0]	
1.    150700.0    None
2     225000.0   [225000.0, 225000.0]
3.    305000.0	 [305606.0, 305000.0, 1067.5]
4     392000.0   [392000.0]	
5     198400.0	  396

我漏掉了什么?请帮助。

英文:

I have a pandas dataframe where one column is all float, another column either contains list of floats, None, or just float values. I have ensured all values are floats.

Ultimately, I want to use pd.isin() to check how many records of value_1 are in value_2 but it is not working for me. When I ran this code below:

df[~df['value_1'].isin(df['value_2'])]

This below is what it returned which is not expected since clearly some values in value_1 are in the value_2 lists.:

0     88870.0    [88870.0]	
1.    150700.0    None
2     225000.0   [225000.0, 225000.0]
3.    305000.0	 [305606.0, 305000.0, 1067.5]
4     392000.0   [392000.0]	
5     198400.0	  396

What am I missing? Please help.

答案1

得分: 2

您可以在列表推导式中使用numpy.isin进行布尔索引

import numpy as np

out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

输出:

    value_1                       value_2
0   88870.0                     [88870.0]
2  225000.0          [225000.0, 225000.0]
3  305000.0  [305606.0, 305000.0, 1067.5]
4  392000.0                    [392000.0]
英文:

You can use boolean indexing with numpy.isin in a list comprehension:

import numpy as np

out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

Output:

    value_1                       value_2
0   88870.0                     [88870.0]
2  225000.0          [225000.0, 225000.0]
3  305000.0  [305606.0, 305000.0, 1067.5]
4  392000.0                    [392000.0]

答案2

得分: 1

使用zip和列表推导来测试列表是否不包含浮点数,如果不包含浮点数,则通过传递False来删除行,使用布尔索引进行筛选:

df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
                   'value_2':[[88870.0],None, [88870.0,45.4]]})

print(df)

输出:

    value_1          value_2
0   88870.0        [88870.0]
1  150700.0             None
2  392000.0  [88870.0, 45.4]

针对测试标量值的需求:

mask = [a not in b if isinstance(b, list) else a != b 
        for a, b in zip(df['value_1'], df['value_2'])]
df2 = df[mask]
print(df2)

输出:

    value_1          value_2
1  150700.0             None
2  392000.0  [88870.0, 45.4]

性能方面:纯Python应该更快,最好在真实数据中进行测试:

# 20k行
N = 10000
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
                   'value_2':[[88870.0],None, [88870.0,45.4]] * N})

# 在真实数据中进行性能测试
%timeit df[[a not in b if isinstance(b, list) else a != b  for a, b in zip(df['value_1'], df['value_2'])]]
%timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

请注意,这是您提供的代码的翻译。

英文:

Use zip with list comprehension for test if lists not contains floats, if not lists are removed rows by passing False, filter in boolean indexing:

df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
                   'value_2':[[88870.0],None, [88870.0,45.4]]})

print (df)
    value_1          value_2
0   88870.0        [88870.0]
1  150700.0             None
2  392000.0  [88870.0, 45.4]

mask = [a not in b if isinstance(b, list) else False 
        for a, b in zip(df['value_1'], df['value_2'])]
df1 = df[mask]
print (df1)
    value_1          value_2
2  392000.0  [88870.0, 45.4]

If need also test scalars:

mask = [a not in b if isinstance(b, list) else a != b 
        for a, b in zip(df['value_1'], df['value_2'])]
df2 = df[mask]
print (df2)
    value_1          value_2
1  150700.0             None
2  392000.0  [88870.0, 45.4]

Performance: Pure python should be faster, best test in real data:

#20k rows
N = 10000
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
                   'value_2':[[88870.0],None, [88870.0,45.4]] * N})

print (df)


In [51]: %timeit df[[a not in b if isinstance(b, list) else a != b  for a, b in zip(df['value_1'], df['value_2'])]]
18.8 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [52]: %timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
419 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

huangapple
  • 本文由 发表于 2023年2月8日 15:13:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75382434.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定