识别重复的Python列名并添加特定后缀的函数

huangapple go评论101阅读模式
英文:

Function to identify duplicate Python column names and add specific suffixes

问题

  1. def change_colnames(df, suffixes):
  2. new_columns = []
  3. seen_columns = {}
  4. for column in df.columns:
  5. match = re.match(r'^(.*?)(?:\.\d+)?$', column) # 匹配基本列名和可选后缀
  6. base_column = match.group(1) if match else column # 获取基本列名或保持原始列名
  7. if base_column in seen_columns:
  8. idx = seen_columns[base_column] # 获取基本列的索引
  9. new_column = f"{base_column} - {suffixes[idx]}" # 添加新后缀
  10. seen_columns[base_column] += 1 # 为下一次出现增加索引
  11. else:
  12. new_column = base_column
  13. seen_columns[base_column] = 0 # 使用索引0添加基本列
  14. new_columns.append(new_column)
  15. df.columns = new_columns
  16. return df
英文:

I have several dataframes with certain duplicate column names (they come from Excel files). My data looks a little something like this.

  1. original_df= pd.DataFrame({
  2. 'ID': [True, False, True],
  3. 'Revenue (USDm)': [1000, 2000, 1500],
  4. 'Location': ['London', 'New York', 'Paris'],
  5. 'Year': [2021, 2022, 2023],
  6. 'Sold Products': [10, 20, 30],
  7. 'Leased Products': [5, 10, 15],
  8. 'Investments': [7, 12, 8],
  9. 'Sold Products.1': [15, 25, 35],
  10. 'Leased Products.1': [8, 12, 16],
  11. 'Investments.1': [6, 9, 11],
  12. 'Sold Products.2': [5, 10, 15],
  13. 'Leased Products.2': [2, 5, 8],
  14. 'Investments.2': [3, 7, 4],
  15. 'QC Completed?': [True, True, False],
  16. })

When I read the df, pandas automatically adds the .1 and .2 suffixes to the duplicate column names. I tried to write a function that identifies the duplicates and adds a new set of suffixes from a list I provide, while removing the .1 and .2 where applicable.

The new suffixes list is suffixes = ['Vehicles','Electronics','Real Estate']

The output should look like this:

  1. desired_output = pd.DataFrame({
  2. 'ID': [True, False, True],
  3. 'Revenue (USDm)': [1000, 2000, 1500],
  4. 'Location': ['London', 'New York', 'Paris'],
  5. 'Year': [2021, 2022, 2023],
  6. 'Sold Products - Vehicles': [10, 20, 30],
  7. 'Leased Products - Vehicles': [5, 10, 15],
  8. 'Investments - Vehicles': [7, 12, 8],
  9. 'Sold Products - Electronics': [15, 25, 35],
  10. 'Leased Products - Electronics': [8, 12, 16],
  11. 'Investments - Electronics': [6, 9, 11],
  12. 'Sold Products - Real Estate': [5, 10, 15],
  13. 'Leased Products - Real Estate': [2, 5, 8],
  14. 'Investments - Real Estate': [3, 7, 4],
  15. 'QC Completed?': [True, True, False],
  16. })

The column names without any duplicates should remain the same but the columns which are duplicated get added the suffixes in order; If they also have the .1 and .2 suffixes, those get removed.

My function is below:

  1. def change_colnames(df, suffixes):
  2. new_columns = []
  3. seen_columns = {}
  4. for column in df.columns:
  5. match = re.match(r'^(.*?)(?:\.\d+)?$', column) # Match the base column name and optional suffix
  6. base_column = match.group(1) if match else column # Get the base column name or keep the original column name
  7. if base_column in seen_columns:
  8. idx = seen_columns[base_column] # Get the index of the base column
  9. new_column = f"{base_column} {suffixes[idx]}" # Append the new suffix
  10. seen_columns[base_column] += 1 # Increment the index for the next occurrence
  11. else:
  12. new_column = base_column
  13. seen_columns[base_column] = 0 # Add the base column with index 0
  14. new_columns.append(new_column)
  15. df.columns = new_columns
  16. return df

Unfortunately the first set of duplicate columns (those without the .1 and .2 suffixes) stays the same. The output I get is this:

  1. wrong_output = pd.DataFrame({
  2. 'ID': [True, False, True],
  3. 'Revenue (USDm)': [1000, 2000, 1500],
  4. 'Location': ['London', 'New York', 'Paris'],
  5. 'Year': [2021, 2022, 2023],
  6. 'Sold Products': [10, 20, 30],
  7. 'Leased Products': [5, 10, 15],
  8. 'Investments': [7, 12, 8],
  9. 'Sold Products - Vehicles': [15, 25, 35],
  10. 'Leased Products - Vehicles': [8, 12, 16],
  11. 'Investments - Vehicles': [6, 9, 11],
  12. 'Sold Products - Electronics': [5, 10, 15],
  13. 'Leased Products - Electronics': [2, 5, 8],
  14. 'Investments - Electronics': [3, 7, 4],
  15. 'QC Completed?': [True, True, False],
  16. })

Any idea how to fix it?

答案1

得分: 1

使用enumerate创建字典,并通过GroupBy.cumcount将计数器中的重复值映射:

  1. suffixes = ['Vehicles', 'Electronics', 'Real Estate']
  2. d = dict(enumerate(suffixes))
  3. s = original_df.columns.to_series()
  4. new = s.str.replace(r'\.\d+$', '', regex=True)
  5. mapped = (new.groupby(new).cumcount()
  6. .where(new.duplicated(keep=False)).map(d)
  7. .radd(' - ').fillna(''))
  8. original_df.columns = new + mapped

  1. print (original_df)
  2. ID Revenue (USDm) Location Year Sold Products - Vehicles \
  3. 0 True 1000 London 2021 10
  4. 1 False 2000 New York 2022 20
  5. 2 True 1500 Paris 2023 30
  6. Leased Products - Vehicles Investments - Vehicles \
  7. 0 5 7
  8. 1 10 12
  9. 2 15 8
  10. Sold Products - Electronics Leased Products - Electronics \
  11. 0 15 8
  12. 1 25 12
  13. 2 35 16
  14. Investments - Electronics Sold Products - Real Estate \
  15. 0 6 5
  16. 1 9 10
  17. 2 11 15
  18. Leased Products - Real Estate Investments - Real Estate QC Completed?
  19. 0 2 3 True
  20. 1 5 7 True
  21. 2 8 4 False
英文:

Create dictionary by enumerate and mapping duplicated values in counter by GroupBy.cumcount:

  1. suffixes = ['Vehicles','Electronics','Real Estate']
  2. d = dict(enumerate(suffixes))
  3. s = original_df.columns.to_series()
  4. new = s.str.replace(r'\.\d+$','', regex=True)
  5. mapped = (new.groupby(new).cumcount()
  6. .where(new.duplicated(keep=False)).map(d)
  7. .radd(' - ').fillna(''))
  8. original_df.columns = new + mapped

  1. print (original_df)
  2. ID Revenue (USDm) Location Year Sold Products - Vehicles \
  3. 0 True 1000 London 2021 10
  4. 1 False 2000 New York 2022 20
  5. 2 True 1500 Paris 2023 30
  6. Leased Products - Vehicles Investments - Vehicles \
  7. 0 5 7
  8. 1 10 12
  9. 2 15 8
  10. Sold Products - Electronics Leased Products - Electronics \
  11. 0 15 8
  12. 1 25 12
  13. 2 35 16
  14. Investments - Electronics Sold Products - Real Estate \
  15. 0 6 5
  16. 1 9 10
  17. 2 11 15
  18. Leased Products - Real Estate Investments - Real Estate QC Completed?
  19. 0 2 3 True
  20. 1 5 7 True
  21. 2 8 4 False

答案2

得分: 1

你可以使用正则表达式和 str.replace 进行处理,这里还有一个用于灵活性的自定义函数:

  1. import re
  2. # 识别重复列
  3. dup_cols = original_df.filter(regex=r'\.\d+$').columns
  4. # 获取重复列的基本名称
  5. base = dup_cols.str.replace(r'\.\d+$', '').unique()
  6. # ['Sold Products', 'Leased Products', 'Investments']
  7. # 创建模式
  8. pattern = fr"^({'|'.join(map(re.escape, base))})(\.\d+)?$"
  9. # '^(Sold\\ Products|Leased\\ Products|Investments)(\\.\\d+)?$'
  10. suffixes = ['Vehicles', 'Electronics', 'Real Estate']
  11. dic = dict(enumerate(suffixes))
  12. def f(m):
  13. suffix = m.group(2)
  14. if suffix:
  15. suffix = dic.get(int(suffix[1:]), '')
  16. else:
  17. suffix = dic[0]
  18. return m.group(1) + ' - ' + suffix
  19. # 基于模式进行替换
  20. original_df.columns = original_df.columns.str.replace(pattern, f, regex=True)

输出:

  1. ID Revenue (USDm) Location Year Sold Products - Vehicles Leased Products - Vehicles Investments - Vehicles Sold Products - Electronics Leased Products - Electronics \
  2. 0 True 1000 London 2021 10 5 7 15 8
  3. 1 False 2000 New York 2022 20 10 12 25 12
  4. 2 True 1500 Paris 2023 30 15 8 35 16
  5. Investments - Electronics Sold Products - Real Estate Leased Products - Real Estate Investments - Real Estate QC Completed?
  6. 0 6 5 2 3 True
  7. 1 9 10 5 7 True
  8. 2 11 15 8 4 False
英文:

You could use a regex for that and str.replace, here with a custom function for flexibility:

  1. import re
  2. # identify duplicated columns
  3. dup_cols = original_df.filter(regex=r'\.\d+$').columns
  4. # get the base names of the duplicates
  5. base = dup_cols.str.replace(r'\.\d+$', '').unique()
  6. # ['Sold Products', 'Leased Products', 'Investments']
  7. # craft a pattern
  8. pattern = fr"^({'|'.join(map(re.escape, base))})(\.\d+)?$"
  9. # '^(Sold\\ Products|Leased\\ Products|Investments)(\\.\\d+)?$'
  10. suffixes = ['Vehicles','Electronics','Real Estate']
  11. dic = dict(enumerate(suffixes))
  12. def f(m):
  13. suffix = m.group(2)
  14. if suffix:
  15. suffix = dic.get(int(suffix[1:]), '')
  16. else:
  17. suffix = dic[0]
  18. return m.group(1) + ' - ' + suffix
  19. # replace based on pattern
  20. original_df.columns = original_df.columns.str.replace(pattern, f, regex=True)

Output:

  1. ID Revenue (USDm) Location Year Sold Products - Vehicles Leased Products - Vehicles Investments - Vehicles Sold Products - Electronics Leased Products - Electronics \
  2. 0 True 1000 London 2021 10 5 7 15 8
  3. 1 False 2000 New York 2022 20 10 12 25 12
  4. 2 True 1500 Paris 2023 30 15 8 35 16
  5. Investments - Electronics Sold Products - Real Estate Leased Products - Real Estate Investments - Real Estate QC Completed?
  6. 0 6 5 2 3 True
  7. 1 9 10 5 7 True
  8. 2 11 15 8 4 False

huangapple
  • 本文由 发表于 2023年6月22日 17:47:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76530606.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定