在pandas数据框中将列的数值与列表进行比较。

huangapple go评论59阅读模式
英文:

Comparing values of columns with a list in a pandas dataframe

问题

我有一个数据框,其中有两列,一列是数字,另一列是数字的列表。我想创建第三列,根据第一列中的数字是否存在于第二列中相应列表的前两个元素中,来填充True或False。

以下是我尝试的代码:

import pandas as pd
import numpy as np
df = pd.DataFrame({'number': [1, 2, 3], 'list_of_numbers': [[1, 3, 2, 5], [6, 7, 8, 2, 10], [13, 12, 13, 14, 3]])
df['check'] = np.isin(df['number'], [x[0:2] for x in df['list_of_numbers']])

我期望的输出是[True, False, False],但实际得到的是[True, False, True]。我猜测比较总是针对list_of_numbers中的第一个值 [1, 3, 2, 5] 进行的,所以才得到这样的输出。

我做错了什么?提前感谢您的回答。

英文:

I have two columns in my dataframe, one with a number and one with a list of numbers. I want to create a third column True or False depending on whether the number in the first column exists in the first two elements of the corresponding list in the second column

Below is what I tried :

import pandas as pd
import numpy as np
df = pd.DataFrame({'number': [1, 2, 3], 'list_of_numbers': [[1, 3 ,2 ,5], [6 ,7 ,8 ,2 ,10], [13 ,12 ,13 ,14 ,3]]})
df['check'] = np.isin(df['number'], [x[0:2] for x in df['list_of_numbers']])

I was expecting and output of [True, False, False] but what I got is a [True, False, True]. I am guessing comparison is always being done against first value in list_of_numbers which is [1, 3 ,2 ,5] to get such an output.

What am I doing wrong? Thanks in advance

答案1

得分: 2

你需要在这里使用一个循环:

df['check'] = [a in b[0:2] for a,b in zip(df['number'], df['list_of_numbers'])]

输出:

   number      list_of_numbers  check
0       1         [1, 3, 2, 5]   True
1       2     [6, 7, 8, 2, 10]  False
2       3  [13, 12, 13, 14, 3]  False

为什么你的方法失败了

np.isin 在使用前会扁平化 test_element 数组,所以你不是将每个元素与每个列表进行比较,而是与所有列表的连接进行比较。

演示:

import pandas as pd
import numpy as np
df = pd.DataFrame({'number': [1, 1, 3], # 我们将2改为1
                   'list_of_numbers': [[1, 7, 2, 5],  # 我们移除了3
                                       [6, 7, 8, 2, 10],
                                       [13, 12, 13, 14, 3]]})

df['check'] = np.isin(df['number'], [x[0:2] for x in df['list_of_numbers']])
print(df)
   number      list_of_numbers  check
0       1         [1, 7, 2, 5]   True
1       1     [6, 7, 8, 2, 10]   True # 因为第一个列表有1,所以为True
2       3  [13, 12, 13, 14, 3]  False # 现在因为第一个列表中的3不见了,所以为False
英文:

You need to use a loop here:

df['check'] = [a in b[0:2] for a,b in zip(df['number'], df['list_of_numbers'])]

Output:

   number      list_of_numbers  check
0       1         [1, 3, 2, 5]   True
1       2     [6, 7, 8, 2, 10]  False
2       3  [13, 12, 13, 14, 3]  False

why your approach failed

np.isin flattens the test_element array before use, so you do not test each element against each list, but rather against the concatenation of all lists

Demonstration:

import pandas as pd
import numpy as np
df = pd.DataFrame({'number': [1, 1, 3], # we changed the 2 in 1
                   'list_of_numbers': [[1, 7, 2, 5],  # we removed the 3
                                       [6, 7, 8, 2, 10],
                                       [13, 12, 13, 14, 3]]})

df['check'] = np.isin(df['number'], [x[0:2] for x in df['list_of_numbers']])
print(df)
   number      list_of_numbers  check
0       1         [1, 7, 2, 5]   True
1       1     [6, 7, 8, 2, 10]   True # we have True as the first list has a 1
2       3  [13, 12, 13, 14, 3]  False # now we have False as the 3 in the first list is gone

huangapple
  • 本文由 发表于 2023年3月9日 22:35:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686022.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定