英文:
Pandas isin() not working properly with numerical values
问题
我有一个pandas数据帧,其中一列全是浮点数,另一列包含浮点数列表、None或只是浮点数值。我已确保所有值都是浮点数。
最终,我想使用pd.isin()
来检查value_2
中有多少记录的value_1
,但对我来说没有效果。当我运行下面的代码时:
df[~df['value_1'].isin(df['value_2'])]
下面是它返回的结果,这不是预期的,因为显然value_1
中的某些值在value_2
列表中:
value_1 value_2
0 88870.0 [88870.0]
1. 150700.0 None
2 225000.0 [225000.0, 225000.0]
3. 305000.0 [305606.0, 305000.0, 1067.5]
4 392000.0 [392000.0]
5 198400.0 396
我漏掉了什么?请帮助。
英文:
I have a pandas dataframe where one column is all float, another column either contains list of floats, None, or just float values. I have ensured all values are floats.
Ultimately, I want to use pd.isin()
to check how many records of value_1
are in value_2
but it is not working for me. When I ran this code below:
df[~df['value_1'].isin(df['value_2'])]
This below is what it returned which is not expected since clearly some values in value_1
are in the value_2
lists.:
0 88870.0 [88870.0]
1. 150700.0 None
2 225000.0 [225000.0, 225000.0]
3. 305000.0 [305606.0, 305000.0, 1067.5]
4 392000.0 [392000.0]
5 198400.0 396
What am I missing? Please help.
答案1
得分: 2
您可以在列表推导式中使用numpy.isin
进行布尔索引:
import numpy as np
out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
输出:
value_1 value_2
0 88870.0 [88870.0]
2 225000.0 [225000.0, 225000.0]
3 305000.0 [305606.0, 305000.0, 1067.5]
4 392000.0 [392000.0]
英文:
You can use boolean indexing with numpy.isin
in a list comprehension:
import numpy as np
out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
Output:
value_1 value_2
0 88870.0 [88870.0]
2 225000.0 [225000.0, 225000.0]
3 305000.0 [305606.0, 305000.0, 1067.5]
4 392000.0 [392000.0]
答案2
得分: 1
使用zip
和列表推导来测试列表是否不包含浮点数,如果不包含浮点数,则通过传递False
来删除行,使用布尔索引
进行筛选:
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
'value_2':[[88870.0],None, [88870.0,45.4]]})
print(df)
输出:
value_1 value_2
0 88870.0 [88870.0]
1 150700.0 None
2 392000.0 [88870.0, 45.4]
针对测试标量值的需求:
mask = [a not in b if isinstance(b, list) else a != b
for a, b in zip(df['value_1'], df['value_2'])]
df2 = df[mask]
print(df2)
输出:
value_1 value_2
1 150700.0 None
2 392000.0 [88870.0, 45.4]
性能方面:纯Python应该更快,最好在真实数据中进行测试:
# 20k行
N = 10000
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
'value_2':[[88870.0],None, [88870.0,45.4]] * N})
# 在真实数据中进行性能测试
%timeit df[[a not in b if isinstance(b, list) else a != b for a, b in zip(df['value_1'], df['value_2'])]]
%timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
请注意,这是您提供的代码的翻译。
英文:
Use zip
with list comprehension for test if lists not contains floats, if not lists are removed rows by passing False
, filter in boolean indexing
:
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
'value_2':[[88870.0],None, [88870.0,45.4]]})
print (df)
value_1 value_2
0 88870.0 [88870.0]
1 150700.0 None
2 392000.0 [88870.0, 45.4]
mask = [a not in b if isinstance(b, list) else False
for a, b in zip(df['value_1'], df['value_2'])]
df1 = df[mask]
print (df1)
value_1 value_2
2 392000.0 [88870.0, 45.4]
If need also test scalars:
mask = [a not in b if isinstance(b, list) else a != b
for a, b in zip(df['value_1'], df['value_2'])]
df2 = df[mask]
print (df2)
value_1 value_2
1 150700.0 None
2 392000.0 [88870.0, 45.4]
Performance: Pure python should be faster, best test in real data:
#20k rows
N = 10000
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
'value_2':[[88870.0],None, [88870.0,45.4]] * N})
print (df)
In [51]: %timeit df[[a not in b if isinstance(b, list) else a != b for a, b in zip(df['value_1'], df['value_2'])]]
18.8 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
419 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论