2023年6月22日 17:47:01go评论59阅读模式

英文:

Function to identify duplicate Python column names and add specific suffixes

问题

def change_colnames(df, suffixes):
    new_columns = []
    seen_columns = {}

    for column in df.columns:
        match = re.match(r'^(.*?)(?:\.\d+)?$', column)  # 匹配基本列名和可选后缀
        base_column = match.group(1) if match else column  # 获取基本列名或保持原始列名

        if base_column in seen_columns:
            idx = seen_columns[base_column]  # 获取基本列的索引
            new_column = f"{base_column} - {suffixes[idx]}"  # 添加新后缀
            seen_columns[base_column] += 1  # 为下一次出现增加索引
        else:
            new_column = base_column
            seen_columns[base_column] = 0  # 使用索引0添加基本列

        new_columns.append(new_column)

    df.columns = new_columns
    return df

英文:

I have several dataframes with certain duplicate column names (they come from Excel files). My data looks a little something like this.

original_df= pd.DataFrame({
    &#39;ID&#39;: [True, False, True],
    &#39;Revenue (USDm)&#39;: [1000, 2000, 1500],
    &#39;Location&#39;: [&#39;London&#39;, &#39;New York&#39;, &#39;Paris&#39;],
    &#39;Year&#39;: [2021, 2022, 2023],
    &#39;Sold Products&#39;: [10, 20, 30],
    &#39;Leased Products&#39;: [5, 10, 15],
    &#39;Investments&#39;: [7, 12, 8],
    &#39;Sold Products.1&#39;: [15, 25, 35],
    &#39;Leased Products.1&#39;: [8, 12, 16],
    &#39;Investments.1&#39;: [6, 9, 11],
    &#39;Sold Products.2&#39;: [5, 10, 15],
    &#39;Leased Products.2&#39;: [2, 5, 8],
    &#39;Investments.2&#39;: [3, 7, 4],
    &#39;QC Completed?&#39;: [True, True, False],
})

When I read the df, pandas automatically adds the .1 and .2 suffixes to the duplicate column names. I tried to write a function that identifies the duplicates and adds a new set of suffixes from a list I provide, while removing the .1 and .2 where applicable.

The new suffixes list is suffixes = ['Vehicles','Electronics','Real Estate']

The output should look like this:

desired_output = pd.DataFrame({
    &#39;ID&#39;: [True, False, True],
    &#39;Revenue (USDm)&#39;: [1000, 2000, 1500],
    &#39;Location&#39;: [&#39;London&#39;, &#39;New York&#39;, &#39;Paris&#39;],
    &#39;Year&#39;: [2021, 2022, 2023],
    &#39;Sold Products - Vehicles&#39;: [10, 20, 30],
    &#39;Leased Products - Vehicles&#39;: [5, 10, 15],
    &#39;Investments - Vehicles&#39;: [7, 12, 8],
    &#39;Sold Products - Electronics&#39;: [15, 25, 35],
    &#39;Leased Products - Electronics&#39;: [8, 12, 16],
    &#39;Investments - Electronics&#39;: [6, 9, 11],
    &#39;Sold Products - Real Estate&#39;: [5, 10, 15],
    &#39;Leased Products - Real Estate&#39;: [2, 5, 8],
    &#39;Investments - Real Estate&#39;: [3, 7, 4],
    &#39;QC Completed?&#39;: [True, True, False],
})

The column names without any duplicates should remain the same but the columns which are duplicated get added the suffixes in order; If they also have the .1 and .2 suffixes, those get removed.

My function is below:

def change_colnames(df, suffixes):
    new_columns = []
    seen_columns = {}

    for column in df.columns:
        match = re.match(r&#39;^(.*?)(?:\.\d+)?$&#39;, column)  # Match the base column name and optional suffix
        base_column = match.group(1) if match else column  # Get the base column name or keep the original column name

        if base_column in seen_columns:
            idx = seen_columns[base_column]  # Get the index of the base column
            new_column = f&quot;{base_column} {suffixes[idx]}&quot;  # Append the new suffix
            seen_columns[base_column] += 1  # Increment the index for the next occurrence
        else:
            new_column = base_column
            seen_columns[base_column] = 0  # Add the base column with index 0

        new_columns.append(new_column)

    df.columns = new_columns
    return df

Unfortunately the first set of duplicate columns (those without the .1 and .2 suffixes) stays the same. The output I get is this:

wrong_output = pd.DataFrame({
    &#39;ID&#39;: [True, False, True],
    &#39;Revenue (USDm)&#39;: [1000, 2000, 1500],
    &#39;Location&#39;: [&#39;London&#39;, &#39;New York&#39;, &#39;Paris&#39;],
    &#39;Year&#39;: [2021, 2022, 2023],
    &#39;Sold Products&#39;: [10, 20, 30],
    &#39;Leased Products&#39;: [5, 10, 15],
    &#39;Investments&#39;: [7, 12, 8],
    &#39;Sold Products - Vehicles&#39;: [15, 25, 35],
    &#39;Leased Products - Vehicles&#39;: [8, 12, 16],
    &#39;Investments - Vehicles&#39;: [6, 9, 11],
    &#39;Sold Products - Electronics&#39;: [5, 10, 15],
    &#39;Leased Products - Electronics&#39;: [2, 5, 8],
    &#39;Investments - Electronics&#39;: [3, 7, 4],
    &#39;QC Completed?&#39;: [True, True, False],
})

Any idea how to fix it?

答案1

得分: 1

使用enumerate创建字典，并通过GroupBy.cumcount将计数器中的重复值映射：

suffixes = ['Vehicles', 'Electronics', 'Real Estate']
d = dict(enumerate(suffixes))

s = original_df.columns.to_series()

new = s.str.replace(r'\.\d+$', '', regex=True)

mapped = (new.groupby(new).cumcount()
             .where(new.duplicated(keep=False)).map(d)
             .radd(' - ').fillna(''))

original_df.columns = new + mapped

print (original_df)
          ID  Revenue (USDm)  Location  Year  Sold Products - Vehicles  \
    0   True            1000    London  2021                        10   
    1  False            2000  New York  2022                        20   
    2   True            1500     Paris  2023                        30   

       Leased Products - Vehicles  Investments - Vehicles  \
    0                           5                       7   
    1                          10                      12   
    2                          15                       8   

       Sold Products - Electronics  Leased Products - Electronics  \
    0                           15                              8   
    1                           25                             12   
    2                           35                             16   

       Investments - Electronics  Sold Products - Real Estate  \
    0                          6                            5   
    1                          9                           10   
    2                         11                           15   

       Leased Products - Real Estate  Investments - Real Estate  QC Completed?  
    0                              2                          3           True  
    1                              5                          7           True  
    2                              8                          4          False

英文:

Create dictionary by enumerate and mapping duplicated values in counter by GroupBy.cumcount:

suffixes = [&#39;Vehicles&#39;,&#39;Electronics&#39;,&#39;Real Estate&#39;]
d = dict(enumerate(suffixes))

s = original_df.columns.to_series()

new = s.str.replace(r&#39;\.\d+$&#39;,&#39;&#39;, regex=True)

mapped = (new.groupby(new).cumcount()
             .where(new.duplicated(keep=False)).map(d)
             .radd(&#39; - &#39;).fillna(&#39;&#39;))

original_df.columns =  new + mapped

print (original_df)
      ID  Revenue (USDm)  Location  Year  Sold Products - Vehicles  \
0   True            1000    London  2021                        10   
1  False            2000  New York  2022                        20   
2   True            1500     Paris  2023                        30   

   Leased Products - Vehicles  Investments - Vehicles  \
0                           5                       7   
1                          10                      12   
2                          15                       8   

   Sold Products - Electronics  Leased Products - Electronics  \
0                           15                              8   
1                           25                             12   
2                           35                             16   

   Investments - Electronics  Sold Products - Real Estate  \
0                          6                            5   
1                          9                           10   
2                         11                           15   

   Leased Products - Real Estate  Investments - Real Estate  QC Completed?  
0                              2                          3           True  
1                              5                          7           True  
2                              8                          4          False

答案2

得分: 1

你可以使用正则表达式和 str.replace 进行处理，这里还有一个用于灵活性的自定义函数：

import re

# 识别重复列
dup_cols = original_df.filter(regex=r'\.\d+$').columns

# 获取重复列的基本名称
base = dup_cols.str.replace(r'\.\d+$', '').unique()
# ['Sold Products', 'Leased Products', 'Investments']

# 创建模式
pattern = fr"^({'|'.join(map(re.escape, base))})(\.\d+)?$"
# '^(Sold\\ Products|Leased\\ Products|Investments)(\\.\\d+)?$'
suffixes = ['Vehicles', 'Electronics', 'Real Estate']
dic = dict(enumerate(suffixes))

def f(m):
    suffix = m.group(2)
    if suffix:
        suffix = dic.get(int(suffix[1:]), '')
    else:
        suffix = dic[0]
    return m.group(1) + ' - ' + suffix

# 基于模式进行替换
original_df.columns = original_df.columns.str.replace(pattern, f, regex=True)

输出：

      ID  Revenue (USDm)  Location  Year  Sold Products - Vehicles  Leased Products - Vehicles  Investments - Vehicles  Sold Products - Electronics  Leased Products - Electronics  \
0   True            1000    London  2021                        10                           5                       7                           15                              8   
1  False            2000  New York  2022                        20                          10                      12                           25                             12   
2   True            1500     Paris  2023                        30                          15                       8                           35                             16   

   Investments - Electronics  Sold Products - Real Estate  Leased Products - Real Estate  Investments - Real Estate  QC Completed?  
0                          6                            5                              2                          3           True  
1                          9                           10                              5                          7           True  
2                         11                           15                              8                          4          False

英文:

You could use a regex for that and str.replace, here with a custom function for flexibility:

import re

# identify duplicated columns
dup_cols = original_df.filter(regex=r&#39;\.\d+$&#39;).columns

# get the base names of the duplicates
base = dup_cols.str.replace(r&#39;\.\d+$&#39;, &#39;&#39;).unique()
# [&#39;Sold Products&#39;, &#39;Leased Products&#39;, &#39;Investments&#39;]

# craft a pattern
pattern = fr&quot;^({&#39;|&#39;.join(map(re.escape, base))})(\.\d+)?$&quot;
# &#39;^(Sold\\ Products|Leased\\ Products|Investments)(\\.\\d+)?$&#39;
suffixes = [&#39;Vehicles&#39;,&#39;Electronics&#39;,&#39;Real Estate&#39;]
dic = dict(enumerate(suffixes))

def f(m):
    suffix = m.group(2)
    if suffix:
        suffix = dic.get(int(suffix[1:]), &#39;&#39;)
    else:
        suffix = dic[0]
    return m.group(1) + &#39; - &#39; + suffix

# replace based on pattern
original_df.columns = original_df.columns.str.replace(pattern, f, regex=True)

Output:

      ID  Revenue (USDm)  Location  Year  Sold Products - Vehicles  Leased Products - Vehicles  Investments - Vehicles  Sold Products - Electronics  Leased Products - Electronics  \
0   True            1000    London  2021                        10                           5                       7                           15                              8   
1  False            2000  New York  2022                        20                          10                      12                           25                             12   
2   True            1500     Paris  2023                        30                          15                       8                           35                             16   

   Investments - Electronics  Sold Products - Real Estate  Leased Products - Real Estate  Investments - Real Estate  QC Completed?  
0                          6                            5                              2                          3           True  
1                          9                           10                              5                          7           True  
2                         11                           15                              8                          4          False

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

识别重复的Python列名并添加特定后缀的函数

问题

答案1

答案2

你的代码如果所有的if语句都被忽略，可能出现了问题。

ModuleNotFoundError: 使用Metaflow时找不到模块’pandas.core.indexes.numeric’

Accessing a variable of one method inside another method in the same class – Python

预期类“Self”不需要类型参数。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论