如何在Pandas中查找重复的列?

huangapple go评论51阅读模式
英文:

How to find duplicate column in Pandas?

问题

我需要在pandas中删除重复的列,其中所有记录中的所有值都相同,动态地。

例如:

df:

Id  ProductName  ProductSize ProductSize  ProductDesc  Quantity SoldCount  Sales
1   Shoes        9            9           Shoes         143     143         6374
2   Bag          XL           XL          Bag           342     342         2839
3   Laptop       16INCH       16INCH      Laptop        452     452         8293
4   Shoes        9            9           Shoes         143     143         3662
5   Laptop       14INCH       14INCH      Laptop        452     452         7263

在上面的列中,您可以看到有一些具有完全相同名称的重复列,并且在不同列名下的所有记录中有重复值。我试图删除这些列。默认情况下,我保留首次出现的列。

df_output:

Id  ProductName  ProductSize Quantity Sales
1   Shoes        9           143     6374
2   Bag          XL          342     2839
3   Laptop       16INCH      452     8293
4   Shoes        9           143     3662
5   Laptop       14INCH      452     7263
英文:

I need to remove duplicate column in pandas where all the values are same across all records
dynamically.

for example:

df:

Id  ProductName  ProductSize ProductSize  ProductDesc  Quantity SoldCount  Sales
1   Shoes        9            9           Shoes         143     143         6374
2   Bag          XL           XL          Bag           342     342         2839
3   Laptop       16INCH       16INCH      Laptop        452     452         8293
4   Shoes        9            9           Shoes         143     143         3662
5   Laptop       14INCH       14INCH      Laptop        452     452         7263

In the above column you can see there are some duplicate columns with exact same name and there are duplicate values across all records under different column name. I am trying to remove those columns. By default I am keeping first occurred column.

df_output:

Id  ProductName  ProductSize Quantity Sales
1   Shoes        9           143     6374
2   Bag          XL          342     2839
3   Laptop       16INCH      452     8293
4   Shoes        9           143     3662
5   Laptop       14INCH      452     7263

答案1

得分: 1

方法1 - 使用Transpose
然后使用duplicated()方法找到重复的列,仅保留第一次出现的列。接下来,获取唯一的列名,将DataFrame转置回其原始形式,仅保留唯一的列。最后,将结果DataFrame分配给df_output。

# 将DataFrame转置以将列变为行
transposed_df = df.transpose()

# 查找重复列(排除第一次出现)
duplicate_columns = transposed_df.duplicated(keep='first')

# 获取唯一的列名
unique_columns = transposed_df[~duplicate_columns].index

# 将DataFrame转置回来并仅保留唯一列
df_output = df[unique_columns].copy()

# 打印结果DataFrame
print(df_output)

如果ID包含重复项?
在此更新的版本中,首先使用df.reset_index(inplace=True)重置索引,将ID列转换为常规列。在删除重复列后,使用df_output.set_index('Id', inplace=True)再次将ID列设置为索引。

通过重置和重新分配索引,确保了结果DataFrame中保留了重复的ID。

# 重置索引以将Id列转换为常规列
df.reset_index(inplace=True)

# 将DataFrame转置以将列变为行
transposed_df = df.transpose()

# 查找重复列(排除第一次出现)
duplicate_columns = transposed_df.duplicated(keep='first')

# 获取唯一的列名
unique_columns = transposed_df[~duplicate_columns].index

# 将DataFrame转置回来并仅保留唯一列
df_output = df[unique_columns].copy()

# 再次将Id列设置为索引
df_output.set_index('Id', inplace=True)

print(df_output)

方法2 - 利用nunique()方法识别仅包含一个唯一值的列

# 获取每列唯一值的计数
value_counts = df.apply(lambda x: x.nunique())

# 过滤仅包含一个唯一值的列
unique_columns = value_counts[value_counts > 1].index

# 仅保留唯一列
df_output = df[unique_columns].copy()

# 打印结果DataFrame
print(df_output)

如果ID重复?
在仅保留唯一列后,我们使用df_output.index.duplicated()识别重复的ID。然后,我们重置索引以将ID列转换为常规列,并使用df_output[~df_output['Id'].duplicated()]删除具有重复ID的行。最后,再次使用df_output.set_index('Id', inplace=True)将ID列设置为索引。

通过这种方式,您可以在根据值的唯一性删除重复列的同时处理重复的ID。

# 获取每列唯一值的计数
value_counts = df.apply(lambda x: x.nunique())

# 过滤仅包含一个唯一值的列
unique_columns = value_counts[value_counts > 1].index

# 仅保留唯一列
df_output = df[unique_columns].copy()

# 识别重复的ID
duplicate_ids = df_output.index[df_output.index.duplicated()]

# 重置重复的ID索引
df_output.reset_index(inplace=True)

# 从DataFrame中删除重复的ID
df_output = df_output[~df_output['Id'].duplicated()]

# 再次将ID列设置为索引
df_output.set_index('Id', inplace=True)

print(df_output)
英文:

Approach 1 - Uses Transpose
It then finds the duplicate columns using the duplicated() method, keeping only the first occurrence. Next, it obtains the unique column names and transposes the DataFrame back to its original form, keeping only the unique columns. Finally, it assigns the resulting DataFrame to df_output.

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

If the IDs contain duplicates?
In this updated version, the index is reset at the beginning using df.reset_index(inplace=True) to convert the ID column into a regular column. After removing the duplicate columns, the ID column is set as the index again using df_output.set_index('Id', inplace=True).

By resetting and reassigning the index, you ensure that duplicate IDs are preserved in the resulting DataFrame.

# Reset the index to convert the Id column to a regular column
df.reset_index(inplace=True)

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Set the Id column as the index again
df_output.set_index('Id', inplace=True)

print(df_output)

Approach 2 - utilizes the nunique() method to identify columns with only one unique value

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

If the Ids are duplicated?
after keeping only the unique columns, we identify the duplicate IDs using df_output.index.duplicated(). Then, we reset the index to convert the ID column into a regular column and remove the rows with duplicate IDs using df_output[~df_output['Id'].duplicated()]. Finally, the ID column is set as the index again using df_output.set_index('Id', inplace=True).

This way, you can handle duplicate IDs while removing duplicate columns based on the uniqueness of the values.

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Identify duplicate IDs
duplicate_ids = df_output.index[df_output.index.duplicated()]

# Reset index for duplicate IDs
df_output.reset_index(inplace=True)

# Remove duplicate IDs from the DataFrame
df_output = df_output[~df_output['Id'].duplicated()]

# Set the ID column as the index again
df_output.set_index('Id', inplace=True)

print(df_output)

答案2

得分: 0

使用Transpose来使用drop_duplicates

df.T.drop_duplicates().T
英文:

use Transpose to use drop_duplicates

df.T.drop_duplicates().T

huangapple
  • 本文由 发表于 2023年5月14日 19:25:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76247222.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定