英文:
Pandas compare values of multiple columns
问题
import pandas as pd
df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
columns=["mark1", "mark2", "mark3", "mark4", "mark5"])
def compare_col(column):
return len(column) != len(set(column))
df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5']]), axis=1)
Ideal output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
I've corrected the code to achieve your desired output without using a nested for loop. The compare_col
function now checks if there are any duplicate values within a column and returns True if duplicates are found, and False otherwise.
英文:
I want to find out if any of the value in columns mark1, mark2, mark3, mark4 and mark5 are the same, column-wise comparison from a dataframe below, and list result as True or False
import pandas as pd
df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
columns=["mark1", "mark2", 'mark3', 'mark4', 'mark5'])
Ideal output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
So I came up with a func using nested forloop to compare each value in a column, does not work.
AttributeError: 'Series' object has no attribute 'columns'
What's the correct way? Avoid nested forloop by all means.
def compare_col(df):
check = 0
for i in range(len(df.columns.tolist())+1):
for j in range(1, len(df.columns.tolist())+1):
if df.iloc[i, i] == df.iloc[j, i]:
check += 1
if check >= 1:
return True
else:
return False
df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5]]), axis=1)
</details>
# 答案1
**得分**: 2
两者的唯一项数与总大小之间的差异指示存在重复值。
```python
df['result'] = df.apply(lambda x: x.unique().size != x.size, axis=1)
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
英文:
The difference between the number of unique items of a series and its total size points to a presence of duplicated values.
df['result'] = df.apply(lambda x: x.unique().size != x.size, axis=1)
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
答案2
得分: 1
df['result'] = df.apply(lambda row: any(row[col] == row[col2] for col in range(len(df.columns)) for col2 in range(col+1, len(df.columns))), axis=1)
我在这里遇到的问题是,你不能简单地执行row[col] == row[col+1]
,因为那只会检查后续的列,所以你需要两个循环来检查所有可能的值匹配。
英文:
df['result'] = df.apply(lambda row: any(row[col] == row[col2] for col in range(len(df.columns)) for col2 in range(col+1, len(df.columns))), axis=1)
The problem I faced here was that you simply cant do row[col] == row[col+1]
as that will check only subsequent ones, so you need 2 loops to check all possible value matches.
答案3
得分: 0
不需要使用 apply
或循环,将 nunique
的输出与列数进行比较:
df['result'] = df.nunique(axis=1).ne(df.shape[1])
输出:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
如果您想要更高效的方法,假设列数合理(少于一千列)并且数值也合理,您可以使用 [tag:numpy] 来对值进行 sort
,计算 diff
,并检查是否有任何值等于 0
:
import numpy as np
df['result'] = (np.diff(np.sort(df), axis=1) == 0).any(axis=1)
输出:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
英文:
No need to use apply
or a loop, compare the output of nunique
to the number of columns:
df['result'] = df.nunique(axis=1).ne(df.shape[1])
Output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
If you want a more efficient method and assuming a reasonable number of columns (less than a thousand) and numbers, you could use [tag:numpy] to sort
the values, compute the diff
and check whether any value is 0
:
import numpy as np
df['result'] = (np.diff(np.sort(df), axis=1)==0).any(axis=1)
Output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论