2023年2月8日 15:13:49go评论97阅读模式

英文:

Pandas isin() not working properly with numerical values

问题

我有一个pandas数据帧，其中一列全是浮点数，另一列包含浮点数列表、None或只是浮点数值。我已确保所有值都是浮点数。

最终，我想使用pd.isin()来检查value_2中有多少记录的value_1，但对我来说没有效果。当我运行下面的代码时：

df[~df['value_1'].isin(df['value_2'])]

下面是它返回的结果，这不是预期的，因为显然value_1中的某些值在value_2列表中：

   value_1     value_2
0     88870.0    [88870.0]	
1.    150700.0    None
2     225000.0   [225000.0, 225000.0]
3.    305000.0	 [305606.0, 305000.0, 1067.5]
4     392000.0   [392000.0]	
5     198400.0	  396

我漏掉了什么？请帮助。

英文:

I have a pandas dataframe where one column is all float, another column either contains list of floats, None, or just float values. I have ensured all values are floats.

Ultimately, I want to use pd.isin() to check how many records of value_1 are in value_2 but it is not working for me. When I ran this code below:

df[~df[&#39;value_1&#39;].isin(df[&#39;value_2&#39;])]

This below is what it returned which is not expected since clearly some values in value_1 are in the value_2 lists.:

0     88870.0    [88870.0]	
1.    150700.0    None
2     225000.0   [225000.0, 225000.0]
3.    305000.0	 [305606.0, 305000.0, 1067.5]
4     392000.0   [392000.0]	
5     198400.0	  396

What am I missing? Please help.

答案1

得分: 2

您可以在列表推导式中使用numpy.isin进行布尔索引：

import numpy as np
out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

输出：

    value_1                       value_2
0   88870.0                     [88870.0]
2  225000.0          [225000.0, 225000.0]
3  305000.0  [305606.0, 305000.0, 1067.5]
4  392000.0                    [392000.0]

英文:

You can use boolean indexing with numpy.isin in a list comprehension:

import numpy as np
out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df[&#39;value_1&#39;], df[&#39;value_2&#39;])]]

Output:

    value_1                       value_2
0   88870.0                     [88870.0]
2  225000.0          [225000.0, 225000.0]
3  305000.0  [305606.0, 305000.0, 1067.5]
4  392000.0                    [392000.0]

答案2

得分: 1

使用zip和列表推导来测试列表是否不包含浮点数，如果不包含浮点数，则通过传递False来删除行，使用布尔索引进行筛选：

df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
                   'value_2':[[88870.0],None, [88870.0,45.4]]})
print(df)

输出：

    value_1          value_2
0   88870.0        [88870.0]
1  150700.0             None
2  392000.0  [88870.0, 45.4]

针对测试标量值的需求：

mask = [a not in b if isinstance(b, list) else a != b 
        for a, b in zip(df['value_1'], df['value_2'])]
df2 = df[mask]
print(df2)

输出：

    value_1          value_2
1  150700.0             None
2  392000.0  [88870.0, 45.4]

性能方面：纯Python应该更快，最好在真实数据中进行测试：

# 20k行
N = 10000
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
                   'value_2':[[88870.0],None, [88870.0,45.4]] * N})
# 在真实数据中进行性能测试
%timeit df[[a not in b if isinstance(b, list) else a != b  for a, b in zip(df['value_1'], df['value_2'])]]
%timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

请注意，这是您提供的代码的翻译。

英文:

Use zip with list comprehension for test if lists not contains floats, if not lists are removed rows by passing False, filter in boolean indexing:

df = pd.DataFrame({&#39;value_1&#39;:[88870.0,150700.0,392000.0],
                   &#39;value_2&#39;:[[88870.0],None, [88870.0,45.4]]})
print (df)
    value_1          value_2
0   88870.0        [88870.0]
1  150700.0             None
2  392000.0  [88870.0, 45.4]
mask = [a not in b if isinstance(b, list) else False 
        for a, b in zip(df[&#39;value_1&#39;], df[&#39;value_2&#39;])]
df1 = df[mask]
print (df1)
    value_1          value_2
2  392000.0  [88870.0, 45.4]

If need also test scalars:

mask = [a not in b if isinstance(b, list) else a != b 
        for a, b in zip(df[&#39;value_1&#39;], df[&#39;value_2&#39;])]
df2 = df[mask]
print (df2)
    value_1          value_2
1  150700.0             None
2  392000.0  [88870.0, 45.4]

Performance: Pure python should be faster, best test in real data:

#20k rows
N = 10000
df = pd.DataFrame({&#39;value_1&#39;:[88870.0,150700.0,392000.0] * N,
                   &#39;value_2&#39;:[[88870.0],None, [88870.0,45.4]] * N})
print (df)
In [51]: %timeit df[[a not in b if isinstance(b, list) else a != b  for a, b in zip(df[&#39;value_1&#39;], df[&#39;value_2&#39;])]]
18.8 ms &#177; 1.99 ms per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df[&#39;value_1&#39;], df[&#39;value_2&#39;])]]
419 ms &#177; 3.8 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas isin()在数字值上不正常工作。

问题

答案1

答案2

有没有一些类似的替代方法来同时使用classmethod和property装饰器？

卸载 Visual Studio Code 中的 Pylance 和 Flake8

Passing tf.RaggedTensor to tfp.Distribution’s methods in Python Tensorflow.

r.history 返回一个空列表 [python]

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。