2023年6月29日 00:12:48go评论118阅读模式

英文:

Pandas compare values of multiple columns

问题

import pandas as pd
df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
                  columns=["mark1", "mark2", "mark3", "mark4", "mark5"])
def compare_col(column):
    return len(column) != len(set(column))
df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5']]), axis=1)

Ideal output:

   mark1  mark2  mark3  mark4  mark5  result
0      7      2      3      7      7    True
1      3      4      3      2      7    True
2      1      6      5      2      7   False
3      5      5      6      3      1    True

I've corrected the code to achieve your desired output without using a nested for loop. The compare_col function now checks if there are any duplicate values within a column and returns True if duplicates are found, and False otherwise.

英文:

I want to find out if any of the value in columns mark1, mark2, mark3, mark4 and mark5 are the same, column-wise comparison from a dataframe below, and list result as True or False

import pandas as pd
df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
                  columns=[&quot;mark1&quot;, &quot;mark2&quot;, &#39;mark3&#39;, &#39;mark4&#39;, &#39;mark5&#39;])

Ideal output:

   mark1  mark2  mark3  mark4  mark5          result
0      7      2      3      7      7            True
1      3      4      3      2      7            True
2      1      6      5      2      7           False
3      5      5      6      3      1            True

So I came up with a func using nested forloop to compare each value in a column, does not work.
AttributeError: 'Series' object has no attribute 'columns'
What's the correct way? Avoid nested forloop by all means.

def compare_col(df):
	check = 0
	for i in range(len(df.columns.tolist())+1):
		for j in range(1, len(df.columns.tolist())+1):
			if df.iloc[i, i] == df.iloc[j, i]:
				check += 1
	if check &gt;= 1:
		return True
	else:
		return False
df[&#39;result&#39;] = df.apply(lambda x: compare_col(x[[&#39;mark1&#39;, &#39;mark2&#39;, &#39;mark3&#39;, &#39;mark4&#39;, &#39;mark5]]), axis=1)
</details>
# 答案1
**得分**: 2
两者的唯一项数与总大小之间的差异指示存在重复值。
```python
df['result'] = df.apply(lambda x: x.unique().size != x.size, axis=1)

   mark1  mark2  mark3  mark4  mark5  result
0      7      2      3      7      7    True
1      3      4      3      2      7    True
2      1      6      5      2      7   False
3      5      5      6      3      1    True

英文:

The difference between the number of unique items of a series and its total size points to a presence of duplicated values.

df[&#39;result&#39;] = df.apply(lambda x: x.unique().size != x.size, axis=1)

   mark1  mark2  mark3  mark4  mark5  result
0      7      2      3      7      7    True
1      3      4      3      2      7    True
2      1      6      5      2      7   False
3      5      5      6      3      1    True

答案2

得分: 1

df['result'] = df.apply(lambda row: any(row[col] == row[col2] for col in range(len(df.columns)) for col2 in range(col+1, len(df.columns))), axis=1)

我在这里遇到的问题是，你不能简单地执行row[col] == row[col+1]，因为那只会检查后续的列，所以你需要两个循环来检查所有可能的值匹配。

英文:

df[&#39;result&#39;] = df.apply(lambda row: any(row[col] == row[col2] for col in range(len(df.columns)) for col2 in range(col+1, len(df.columns))), axis=1)

The problem I faced here was that you simply cant do row[col] == row[col+1] as that will check only subsequent ones, so you need 2 loops to check all possible value matches.

答案3

得分: 0

不需要使用 apply 或循环，将 nunique 的输出与列数进行比较：

df['result'] = df.nunique(axis=1).ne(df.shape[1])

输出：

   mark1  mark2  mark3  mark4  mark5  result
0      7      2      3      7      7    True
1      3      4      3      2      7    True
2      1      6      5      2      7   False
3      5      5      6      3      1    True

如果您想要更高效的方法，假设列数合理（少于一千列）并且数值也合理，您可以使用 [tag:numpy] 来对值进行 sort，计算 diff，并检查是否有任何值等于 0：

import numpy as np
df['result'] = (np.diff(np.sort(df), axis=1) == 0).any(axis=1)

输出：

   mark1  mark2  mark3  mark4  mark5  result
0      7      2      3      7      7    True
1      3      4      3      2      7    True
2      1      6      5      2      7   False
3      5      5      6      3      1    True

英文:

No need to use apply or a loop, compare the output of nunique to the number of columns:

df[&#39;result&#39;] = df.nunique(axis=1).ne(df.shape[1])

Output:

   mark1  mark2  mark3  mark4  mark5  result
0      7      2      3      7      7    True
1      3      4      3      2      7    True
2      1      6      5      2      7   False
3      5      5      6      3      1    True

If you want a more efficient method and assuming a reasonable number of columns (less than a thousand) and numbers, you could use [tag:numpy] to sort the values, compute the diff and check whether any value is 0:

import numpy as np
df[&#39;result&#39;] = (np.diff(np.sort(df), axis=1)==0).any(axis=1)

Output:

   mark1  mark2  mark3  mark4  mark5  result
0      7      2      3      7      7    True
1      3      4      3      2      7    True
2      1      6      5      2      7   False
3      5      5      6      3      1    True

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas比较多列的值

问题

答案2

答案3

SQLAlchemy 2.x：立即加载连接的集合查询

‘模块’ 对象在Python中不可调用

如何创建带有子模块的Python C++扩展，可以被导入。

如何在Django管理界面中将空值分配给对象属性？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。