2023年5月14日 19:25:49go评论133阅读模式

英文:

How to find duplicate column in Pandas?

问题

我需要在pandas中删除重复的列，其中所有记录中的所有值都相同，动态地。

例如：

df：

Id  ProductName  ProductSize ProductSize  ProductDesc  Quantity SoldCount  Sales
1   Shoes        9            9           Shoes         143     143         6374
2   Bag          XL           XL          Bag           342     342         2839
3   Laptop       16INCH       16INCH      Laptop        452     452         8293
4   Shoes        9            9           Shoes         143     143         3662
5   Laptop       14INCH       14INCH      Laptop        452     452         7263

在上面的列中，您可以看到有一些具有完全相同名称的重复列，并且在不同列名下的所有记录中有重复值。我试图删除这些列。默认情况下，我保留首次出现的列。

df_output：

Id  ProductName  ProductSize Quantity Sales
1   Shoes        9           143     6374
2   Bag          XL          342     2839
3   Laptop       16INCH      452     8293
4   Shoes        9           143     3662
5   Laptop       14INCH      452     7263

英文:

I need to remove duplicate column in pandas where all the values are same across all records
dynamically.

for example:

df:

Id  ProductName  ProductSize ProductSize  ProductDesc  Quantity SoldCount  Sales
1   Shoes        9            9           Shoes         143     143         6374
2   Bag          XL           XL          Bag           342     342         2839
3   Laptop       16INCH       16INCH      Laptop        452     452         8293
4   Shoes        9            9           Shoes         143     143         3662
5   Laptop       14INCH       14INCH      Laptop        452     452         7263

In the above column you can see there are some duplicate columns with exact same name and there are duplicate values across all records under different column name. I am trying to remove those columns. By default I am keeping first occurred column.

df_output:

Id  ProductName  ProductSize Quantity Sales
1   Shoes        9           143     6374
2   Bag          XL          342     2839
3   Laptop       16INCH      452     8293
4   Shoes        9           143     3662
5   Laptop       14INCH      452     7263

答案1

得分: 1

方法1 - 使用Transpose
然后使用duplicated()方法找到重复的列，仅保留第一次出现的列。接下来，获取唯一的列名，将DataFrame转置回其原始形式，仅保留唯一的列。最后，将结果DataFrame分配给df_output。

# 将DataFrame转置以将列变为行
transposed_df = df.transpose()

# 查找重复列（排除第一次出现）
duplicate_columns = transposed_df.duplicated(keep='first')

# 获取唯一的列名
unique_columns = transposed_df[~duplicate_columns].index

# 将DataFrame转置回来并仅保留唯一列
df_output = df[unique_columns].copy()

# 打印结果DataFrame
print(df_output)

如果ID包含重复项？
在此更新的版本中，首先使用df.reset_index(inplace=True)重置索引，将ID列转换为常规列。在删除重复列后，使用df_output.set_index('Id', inplace=True)再次将ID列设置为索引。

通过重置和重新分配索引，确保了结果DataFrame中保留了重复的ID。

# 重置索引以将Id列转换为常规列
df.reset_index(inplace=True)

# 将DataFrame转置以将列变为行
transposed_df = df.transpose()

# 查找重复列（排除第一次出现）
duplicate_columns = transposed_df.duplicated(keep='first')

# 获取唯一的列名
unique_columns = transposed_df[~duplicate_columns].index

# 将DataFrame转置回来并仅保留唯一列
df_output = df[unique_columns].copy()

# 再次将Id列设置为索引
df_output.set_index('Id', inplace=True)

print(df_output)

方法2 - 利用nunique()方法识别仅包含一个唯一值的列

# 获取每列唯一值的计数
value_counts = df.apply(lambda x: x.nunique())

# 过滤仅包含一个唯一值的列
unique_columns = value_counts[value_counts > 1].index

# 仅保留唯一列
df_output = df[unique_columns].copy()

# 打印结果DataFrame
print(df_output)

如果ID重复？
在仅保留唯一列后，我们使用df_output.index.duplicated()识别重复的ID。然后，我们重置索引以将ID列转换为常规列，并使用df_output[~df_output['Id'].duplicated()]删除具有重复ID的行。最后，再次使用df_output.set_index('Id', inplace=True)将ID列设置为索引。

通过这种方式，您可以在根据值的唯一性删除重复列的同时处理重复的ID。

# 获取每列唯一值的计数
value_counts = df.apply(lambda x: x.nunique())

# 过滤仅包含一个唯一值的列
unique_columns = value_counts[value_counts > 1].index

# 仅保留唯一列
df_output = df[unique_columns].copy()

# 识别重复的ID
duplicate_ids = df_output.index[df_output.index.duplicated()]

# 重置重复的ID索引
df_output.reset_index(inplace=True)

# 从DataFrame中删除重复的ID
df_output = df_output[~df_output['Id'].duplicated()]

# 再次将ID列设置为索引
df_output.set_index('Id', inplace=True)

print(df_output)

英文:

Approach 1 - Uses Transpose
It then finds the duplicate columns using the duplicated() method, keeping only the first occurrence. Next, it obtains the unique column names and transposes the DataFrame back to its original form, keeping only the unique columns. Finally, it assigns the resulting DataFrame to df_output.

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep=&#39;first&#39;)

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

If the IDs contain duplicates?
In this updated version, the index is reset at the beginning using df.reset_index(inplace=True) to convert the ID column into a regular column. After removing the duplicate columns, the ID column is set as the index again using df_output.set_index('Id', inplace=True).

By resetting and reassigning the index, you ensure that duplicate IDs are preserved in the resulting DataFrame.

# Reset the index to convert the Id column to a regular column
df.reset_index(inplace=True)

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep=&#39;first&#39;)

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Set the Id column as the index again
df_output.set_index(&#39;Id&#39;, inplace=True)

print(df_output)

Approach 2 - utilizes the nunique() method to identify columns with only one unique value

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts &gt; 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

If the Ids are duplicated?
after keeping only the unique columns, we identify the duplicate IDs using df_output.index.duplicated(). Then, we reset the index to convert the ID column into a regular column and remove the rows with duplicate IDs using df_output[~df_output['Id'].duplicated()]. Finally, the ID column is set as the index again using df_output.set_index('Id', inplace=True).

This way, you can handle duplicate IDs while removing duplicate columns based on the uniqueness of the values.

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts &gt; 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Identify duplicate IDs
duplicate_ids = df_output.index[df_output.index.duplicated()]

# Reset index for duplicate IDs
df_output.reset_index(inplace=True)

# Remove duplicate IDs from the DataFrame
df_output = df_output[~df_output[&#39;Id&#39;].duplicated()]

# Set the ID column as the index again
df_output.set_index(&#39;Id&#39;, inplace=True)

print(df_output)

答案2

得分: 0

使用Transpose来使用drop_duplicates

df.T.drop_duplicates().T

英文:

use Transpose to use drop_duplicates

df.T.drop_duplicates().T

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Pandas中查找重复的列？

问题

答案1

答案2

在Python中添加列表元素

Running setup.py install for tesserocr … error

从列输出中提取两个字母的州缩写到新列中。

如果两行之间的某一列数值匹配，根据条件保留带有第三列数值的较新行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论