Pandas:根据所有重复值的值更改包含特定列重复值的行的子集。

huangapple go评论71阅读模式
英文:

Pandas: Change subset of rows that contain duplicate values for a particular column based on values across all duplicates

问题

我是Pandas的新手,正在尝试理解如何修改具有特定列重复值的子集行,决定要更改哪些行是基于对这些重复项的条件检查而做出的。

假设我有一个(虚构的)DataFrame,如下所示:

        Class      Length     Head Teacher   Premium Course                       
    0   Maths      Medium     Mr. Bloggs     Yes
    1   English    Short      Mr. Plum       Yes
    2   English    Long       Mrs. Green     Yes
    3   English    Medium     Mr. Top        Yes 
    4   Science    Long       Mrs. Blue      Yes    
    5   Science    Long       Mr. Red        Yes
    6   ...

在重复的课程名称处,我想替换所有重复项的教师为最长课程的班主任,并删除不是最长课程的所有重复项的高级课程值。如果重复的课程都是相同长度的,则简单地使用第一个重复项的教师,对于高级课程则相反。

        Class      Length     Head Teacher   Premium Course                       
    0   Maths      Medium     Mr. Bloggs     Yes
    1   English    Short      Mrs. Green     
    2   English    Long       Mrs. Green     Yes
    3   English    Medium     Mrs. Green      
    4   Science    Long       Mrs. Blue      Yes    
    5   Science    Long       Mrs. Blue
    6   ... 

在Python中,我通常会使用循环、条件语句等构建一个新的内存列表。但我正在尝试确定在Pandas中最佳的方法。

我一直在研究duplicatedgroupby函数,但一直无法找到解决方案。任何建议或帮助都将有助于我。试图转向以“向量化”的方式思考。

英文:

I'm new to Pandas and trying to understand how to modify a subset of rows that have duplicate values for a particular column, with the decision of which rows to change being made based on a conditional check across those duplicates.

Say I have a (contrived) dataframe like so:

    Class      Length     Head Teacher   Premium Course                       
0   Maths      Medium     Mr. Bloggs     Yes
1   English    Short      Mr. Plum       Yes
2   English    Long       Mrs. Green     Yes
3   English    Medium     Mr. Top        Yes 
4   Science    Long       Mrs. Blue      Yes    
5   Science    Long       Mr. Red        Yes
6   ...

Wherever there are duplicate classes I want to replace the Teacher across all the duplicates with the Head Teacher from the longest class, and remove the Premium Course value for all the duplicates that are not the longest class. If the duplicate classes are all the same length, then simply take the teacher from the first duplicate, and the opposite for the Premium Course ie.

    Class      Length     Head Teacher   Premium Course                       
0   Maths      Medium     Mr. Bloggs     Yes
1   English    Short      Mrs. Green     
2   English    Long       Mrs. Green     Yes
3   English    Medium     Mrs. Green      
4   Science    Long       Mrs. Blue      Yes    
5   Science    Long       Mrs. Blue
6   ... 

In Python I would typically use loops, conditional statements etc and build a new list in memory. But I'm trying to determine the best approach in pandas.

I've been looking at the duplicated and groupby functions but have been unable to land on a solution. Any advice or help would be helpful. Trying to make the shift into thinking in a "Vectorized" way.

答案1

得分: 1

import pandas as pd
data1 = {'Class': ['Maths', 'English', 'English', 'English', 'Science', 'Science'], 
         'Length': ['Medium', 'Short', 'Long', 'Medium', 'Long', 'Long'], 
         'Head Teacher': ['Mr. Bloggs', 'Mr. Plum', 'Mrs. Green', 'Mr. Top', 'Mrs. Blue', 'Mr. Red'], 
         'Premium Course': ['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data1)

**Step1**

生成条件

m = {'Short': 0, 'Medium': 1, 'Long': 2}
cond = df.groupby('Class')['Length'].transform(lambda x: x.index == x.map(m).idxmax())

cond

0     True
1    False
2     True
3    False
4     True
5    False
Name: Length, dtype: bool

**Step2**

编辑列

df['Head Teacher'] = df['Head Teacher'].where(cond).groupby(df['Class']).ffill().bfill()
df['Premium Course'] = df['Premium Course'].where(cond)

df

    Class    Length  Head Teacher Premium Course
0  Maths    Medium  Mr. Bloggs  Yes
1  English  Short   Mrs. Green  NaN
2  English  Long    Mrs. Green  Yes
3  English  Medium  Mrs. Green  NaN
4  Science  Long    Mrs. Blue  Yes
5  Science  Long    Mrs. Blue  NaN
英文:

Example Code

import pandas as pd
data1 = {'Class': ['Maths', 'English', 'English', 'English', 'Science', 'Science'], 
         'Length': ['Medium', 'Short', 'Long', 'Medium', 'Long', 'Long'], 
         'Head Teacher': ['Mr. Bloggs', 'Mr. Plum', 'Mrs. Green', 'Mr. Top', 'Mrs. Blue', 'Mr. Red'], 
         'Premium Course': ['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data1)

Step1

make condition

m = {'Short':0, 'Medium':1, 'Long':2}
cond = df.groupby('Class')['Length'].transform(lambda x: x.index == x.map(m).idxmax())

cond

0     True
1    False
2     True
3    False
4     True
5    False
Name: Length, dtype: bool

Step2

edit columns

df['Head Teacher'] = df['Head Teacher'].where(cond).groupby(df['Class']).ffill().bfill()
df['Premium Course'] = df['Premium Course'].where(cond)

df

    Class	Length	Head Teacher	Premium Course
0	Maths	Medium	Mr. Bloggs	    Yes
1	English	Short	Mrs. Green	    NaN
2	English	Long	Mrs. Green	    Yes
3	English	Medium	Mrs. Green	    NaN
4	Science	Long	Mrs. Blue	    Yes
5	Science	Long	Mrs. Blue	    NaN

答案2

得分: 1

使用有序的 Categorical 来处理 Length 列,这样可以通过 DataFrame.sort_valuesDataFrame.duplicated 创建掩码, DataFrame.sort_index 用于保留原始行的顺序,并且使用 Series.mask 设置不匹配值的 NaN 值,然后使用 GroupBy.transform 获取第一个非 NaN 值:

df['Length'] = pd.Categorical(df['Length'], 
                              categories=['Long','Medium','Short'],
                              ordered=True)

mask = df.sort_values('Length').duplicated(['Class']).sort_index()
df['Head Teacher'] = df['Head Teacher'].mask(mask).groupby(df['Class']).transform('first')
df['Premium Course'] = df['Premium Course'].mask(mask)

print (df)
     Class  Length Head Teacher Premium Course
0    Maths  Medium   Mr. Bloggs            Yes
1  English   Short   Mrs. Green            NaN
2  English    Long   Mrs. Green            Yes
3  English  Medium   Mrs. Green            NaN
4  Science    Long    Mrs. Blue            Yes
5  Science    Long    Mrs. Blue            NaN
英文:

Use ordered Categorical for Length column, so possible create mask by DataFrame.sort_values and DataFrame.duplicated, DataFrame.sort_index is for original order of rows and set NaNs for not matched values in Series.mask with GroupBy.transform for get first non NaN value:

df['Length'] = pd.Categorical(df['Length'], 
                              categories=['Long','Medium','Short'],
                              ordered=True)

mask = df.sort_values('Length').duplicated(['Class']).sort_index()
df['Head Teacher'] = df['Head Teacher'].mask(mask).groupby(df['Class']).transform('first')
df['Premium Course'] = df['Premium Course'].mask(mask)

print (df)
     Class  Length Head Teacher Premium Course
0    Maths  Medium   Mr. Bloggs            Yes
1  English   Short   Mrs. Green            NaN
2  English    Long   Mrs. Green            Yes
3  English  Medium   Mrs. Green            NaN
4  Science    Long    Mrs. Blue            Yes
5  Science    Long    Mrs. Blue            NaN

huangapple
  • 本文由 发表于 2023年7月11日 12:30:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76658724.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定