Pandas比较多列的值

huangapple go评论118阅读模式
英文:

Pandas compare values of multiple columns

问题

  1. import pandas as pd
  2. df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
  3. columns=["mark1", "mark2", "mark3", "mark4", "mark5"])
  4. def compare_col(column):
  5. return len(column) != len(set(column))
  6. df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5']]), axis=1)

Ideal output:

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True

I've corrected the code to achieve your desired output without using a nested for loop. The compare_col function now checks if there are any duplicate values within a column and returns True if duplicates are found, and False otherwise.

英文:

I want to find out if any of the value in columns mark1, mark2, mark3, mark4 and mark5 are the same, column-wise comparison from a dataframe below, and list result as True or False

  1. import pandas as pd
  2. df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
  3. columns=["mark1", "mark2", 'mark3', 'mark4', 'mark5'])

Ideal output:

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True

So I came up with a func using nested forloop to compare each value in a column, does not work.
AttributeError: 'Series' object has no attribute 'columns'
What's the correct way? Avoid nested forloop by all means.

  1. def compare_col(df):
  2. check = 0
  3. for i in range(len(df.columns.tolist())+1):
  4. for j in range(1, len(df.columns.tolist())+1):
  5. if df.iloc[i, i] == df.iloc[j, i]:
  6. check += 1
  7. if check >= 1:
  8. return True
  9. else:
  10. return False
  11. df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5]]), axis=1)
  12. </details>
  13. # 答案1
  14. **得分**: 2
  15. 两者的唯一项数与总大小之间的差异指示存在重复值。
  16. ```python
  17. df['result'] = df.apply(lambda x: x.unique().size != x.size, axis=1)

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True
英文:

The difference between the number of unique items of a series and its total size points to a presence of duplicated values.

  1. df[&#39;result&#39;] = df.apply(lambda x: x.unique().size != x.size, axis=1)

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True

答案2

得分: 1

df['result'] = df.apply(lambda row: any(row[col] == row[col2] for col in range(len(df.columns)) for col2 in range(col+1, len(df.columns))), axis=1)

我在这里遇到的问题是,你不能简单地执行row[col] == row[col+1],因为那只会检查后续的列,所以你需要两个循环来检查所有可能的值匹配。

英文:
  1. df[&#39;result&#39;] = df.apply(lambda row: any(row[col] == row[col2] for col in range(len(df.columns)) for col2 in range(col+1, len(df.columns))), axis=1)

The problem I faced here was that you simply cant do row[col] == row[col+1] as that will check only subsequent ones, so you need 2 loops to check all possible value matches.

答案3

得分: 0

不需要使用 apply 或循环,将 nunique 的输出与列数进行比较:

  1. df['result'] = df.nunique(axis=1).ne(df.shape[1])

输出:

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True

如果您想要更高效的方法,假设列数合理(少于一千列)并且数值也合理,您可以使用 [tag:numpy] 来对值进行 sort,计算 diff,并检查是否有任何值等于 0

  1. import numpy as np
  2. df['result'] = (np.diff(np.sort(df), axis=1) == 0).any(axis=1)

输出:

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True
英文:

No need to use apply or a loop, compare the output of nunique to the number of columns:

  1. df[&#39;result&#39;] = df.nunique(axis=1).ne(df.shape[1])

Output:

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True

If you want a more efficient method and assuming a reasonable number of columns (less than a thousand) and numbers, you could use [tag:numpy] to sort the values, compute the diff and check whether any value is 0:

  1. import numpy as np
  2. df[&#39;result&#39;] = (np.diff(np.sort(df), axis=1)==0).any(axis=1)

Output:

  1. mark1 mark2 mark3 mark4 mark5 result
  2. 0 7 2 3 7 7 True
  3. 1 3 4 3 2 7 True
  4. 2 1 6 5 2 7 False
  5. 3 5 5 6 3 1 True

huangapple
  • 本文由 发表于 2023年6月29日 00:12:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76574991.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定