英文:
How to find duplicate column in Pandas?
问题
我需要在pandas中删除重复的列,其中所有记录中的所有值都相同,动态地。
例如:
df:
Id ProductName ProductSize ProductSize ProductDesc Quantity SoldCount Sales
1 Shoes 9 9 Shoes 143 143 6374
2 Bag XL XL Bag 342 342 2839
3 Laptop 16INCH 16INCH Laptop 452 452 8293
4 Shoes 9 9 Shoes 143 143 3662
5 Laptop 14INCH 14INCH Laptop 452 452 7263
在上面的列中,您可以看到有一些具有完全相同名称的重复列,并且在不同列名下的所有记录中有重复值。我试图删除这些列。默认情况下,我保留首次出现的列。
df_output:
Id ProductName ProductSize Quantity Sales
1 Shoes 9 143 6374
2 Bag XL 342 2839
3 Laptop 16INCH 452 8293
4 Shoes 9 143 3662
5 Laptop 14INCH 452 7263
英文:
I need to remove duplicate column in pandas where all the values are same across all records
dynamically.
for example:
df:
Id ProductName ProductSize ProductSize ProductDesc Quantity SoldCount Sales
1 Shoes 9 9 Shoes 143 143 6374
2 Bag XL XL Bag 342 342 2839
3 Laptop 16INCH 16INCH Laptop 452 452 8293
4 Shoes 9 9 Shoes 143 143 3662
5 Laptop 14INCH 14INCH Laptop 452 452 7263
In the above column you can see there are some duplicate columns with exact same name and there are duplicate values across all records under different column name. I am trying to remove those columns. By default I am keeping first occurred column.
df_output:
Id ProductName ProductSize Quantity Sales
1 Shoes 9 143 6374
2 Bag XL 342 2839
3 Laptop 16INCH 452 8293
4 Shoes 9 143 3662
5 Laptop 14INCH 452 7263
答案1
得分: 1
方法1 - 使用Transpose
然后使用duplicated()方法找到重复的列,仅保留第一次出现的列。接下来,获取唯一的列名,将DataFrame转置回其原始形式,仅保留唯一的列。最后,将结果DataFrame分配给df_output。
# 将DataFrame转置以将列变为行
transposed_df = df.transpose()
# 查找重复列(排除第一次出现)
duplicate_columns = transposed_df.duplicated(keep='first')
# 获取唯一的列名
unique_columns = transposed_df[~duplicate_columns].index
# 将DataFrame转置回来并仅保留唯一列
df_output = df[unique_columns].copy()
# 打印结果DataFrame
print(df_output)
如果ID包含重复项?
在此更新的版本中,首先使用df.reset_index(inplace=True)重置索引,将ID列转换为常规列。在删除重复列后,使用df_output.set_index('Id', inplace=True)再次将ID列设置为索引。
通过重置和重新分配索引,确保了结果DataFrame中保留了重复的ID。
# 重置索引以将Id列转换为常规列
df.reset_index(inplace=True)
# 将DataFrame转置以将列变为行
transposed_df = df.transpose()
# 查找重复列(排除第一次出现)
duplicate_columns = transposed_df.duplicated(keep='first')
# 获取唯一的列名
unique_columns = transposed_df[~duplicate_columns].index
# 将DataFrame转置回来并仅保留唯一列
df_output = df[unique_columns].copy()
# 再次将Id列设置为索引
df_output.set_index('Id', inplace=True)
print(df_output)
方法2 - 利用nunique()方法识别仅包含一个唯一值的列
# 获取每列唯一值的计数
value_counts = df.apply(lambda x: x.nunique())
# 过滤仅包含一个唯一值的列
unique_columns = value_counts[value_counts > 1].index
# 仅保留唯一列
df_output = df[unique_columns].copy()
# 打印结果DataFrame
print(df_output)
如果ID重复?
在仅保留唯一列后,我们使用df_output.index.duplicated()识别重复的ID。然后,我们重置索引以将ID列转换为常规列,并使用df_output[~df_output['Id'].duplicated()]删除具有重复ID的行。最后,再次使用df_output.set_index('Id', inplace=True)将ID列设置为索引。
通过这种方式,您可以在根据值的唯一性删除重复列的同时处理重复的ID。
# 获取每列唯一值的计数
value_counts = df.apply(lambda x: x.nunique())
# 过滤仅包含一个唯一值的列
unique_columns = value_counts[value_counts > 1].index
# 仅保留唯一列
df_output = df[unique_columns].copy()
# 识别重复的ID
duplicate_ids = df_output.index[df_output.index.duplicated()]
# 重置重复的ID索引
df_output.reset_index(inplace=True)
# 从DataFrame中删除重复的ID
df_output = df_output[~df_output['Id'].duplicated()]
# 再次将ID列设置为索引
df_output.set_index('Id', inplace=True)
print(df_output)
英文:
Approach 1 - Uses Transpose
It then finds the duplicate columns using the duplicated() method, keeping only the first occurrence. Next, it obtains the unique column names and transposes the DataFrame back to its original form, keeping only the unique columns. Finally, it assigns the resulting DataFrame to df_output.
# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()
# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')
# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index
# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()
# Print the resulting DataFrame
print(df_output)
If the IDs contain duplicates?
In this updated version, the index is reset at the beginning using df.reset_index(inplace=True) to convert the ID column into a regular column. After removing the duplicate columns, the ID column is set as the index again using df_output.set_index('Id', inplace=True).
By resetting and reassigning the index, you ensure that duplicate IDs are preserved in the resulting DataFrame.
# Reset the index to convert the Id column to a regular column
df.reset_index(inplace=True)
# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()
# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')
# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index
# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()
# Set the Id column as the index again
df_output.set_index('Id', inplace=True)
print(df_output)
Approach 2 - utilizes the nunique() method to identify columns with only one unique value
# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())
# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index
# Keep only the unique columns
df_output = df[unique_columns].copy()
# Print the resulting DataFrame
print(df_output)
If the Ids are duplicated?
after keeping only the unique columns, we identify the duplicate IDs using df_output.index.duplicated(). Then, we reset the index to convert the ID column into a regular column and remove the rows with duplicate IDs using df_output[~df_output['Id'].duplicated()]. Finally, the ID column is set as the index again using df_output.set_index('Id', inplace=True).
This way, you can handle duplicate IDs while removing duplicate columns based on the uniqueness of the values.
# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())
# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index
# Keep only the unique columns
df_output = df[unique_columns].copy()
# Identify duplicate IDs
duplicate_ids = df_output.index[df_output.index.duplicated()]
# Reset index for duplicate IDs
df_output.reset_index(inplace=True)
# Remove duplicate IDs from the DataFrame
df_output = df_output[~df_output['Id'].duplicated()]
# Set the ID column as the index again
df_output.set_index('Id', inplace=True)
print(df_output)
答案2
得分: 0
使用Transpose
来使用drop_duplicates
df.T.drop_duplicates().T
英文:
use Transpose
to use drop_duplicates
df.T.drop_duplicates().T
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论