英文:
Pandas: Change subset of rows that contain duplicate values for a particular column based on values across all duplicates
问题
我是Pandas的新手,正在尝试理解如何修改具有特定列重复值的子集行,决定要更改哪些行是基于对这些重复项的条件检查而做出的。
假设我有一个(虚构的)DataFrame,如下所示:
        Class      Length     Head Teacher   Premium Course                       
    0   Maths      Medium     Mr. Bloggs     Yes
    1   English    Short      Mr. Plum       Yes
    2   English    Long       Mrs. Green     Yes
    3   English    Medium     Mr. Top        Yes 
    4   Science    Long       Mrs. Blue      Yes    
    5   Science    Long       Mr. Red        Yes
    6   ...
在重复的课程名称处,我想替换所有重复项的教师为最长课程的班主任,并删除不是最长课程的所有重复项的高级课程值。如果重复的课程都是相同长度的,则简单地使用第一个重复项的教师,对于高级课程则相反。
        Class      Length     Head Teacher   Premium Course                       
    0   Maths      Medium     Mr. Bloggs     Yes
    1   English    Short      Mrs. Green     
    2   English    Long       Mrs. Green     Yes
    3   English    Medium     Mrs. Green      
    4   Science    Long       Mrs. Blue      Yes    
    5   Science    Long       Mrs. Blue
    6   ... 
在Python中,我通常会使用循环、条件语句等构建一个新的内存列表。但我正在尝试确定在Pandas中最佳的方法。
我一直在研究duplicated和groupby函数,但一直无法找到解决方案。任何建议或帮助都将有助于我。试图转向以“向量化”的方式思考。
英文:
I'm new to Pandas and trying to understand how to modify a subset of rows that have duplicate values for a particular column, with the decision of which rows to change being made based on a conditional check across those duplicates.
Say I have a (contrived) dataframe like so:
    Class      Length     Head Teacher   Premium Course                       
0   Maths      Medium     Mr. Bloggs     Yes
1   English    Short      Mr. Plum       Yes
2   English    Long       Mrs. Green     Yes
3   English    Medium     Mr. Top        Yes 
4   Science    Long       Mrs. Blue      Yes    
5   Science    Long       Mr. Red        Yes
6   ...
Wherever there are duplicate classes I want to replace the Teacher across all the duplicates with the Head Teacher from the longest class, and remove the Premium Course value for all the duplicates that are not the longest class. If the duplicate classes are all the same length, then simply take the teacher from the first duplicate, and the opposite for the Premium Course ie.
    Class      Length     Head Teacher   Premium Course                       
0   Maths      Medium     Mr. Bloggs     Yes
1   English    Short      Mrs. Green     
2   English    Long       Mrs. Green     Yes
3   English    Medium     Mrs. Green      
4   Science    Long       Mrs. Blue      Yes    
5   Science    Long       Mrs. Blue
6   ... 
In Python I would typically use loops, conditional statements etc and build a new list in memory. But I'm trying to determine the best approach in pandas.
I've been looking at the duplicated and groupby functions but have been unable to land on a solution. Any advice or help would be helpful. Trying to make the shift into thinking in a "Vectorized" way.
答案1
得分: 1
import pandas as pd
data1 = {'Class': ['Maths', 'English', 'English', 'English', 'Science', 'Science'], 
         'Length': ['Medium', 'Short', 'Long', 'Medium', 'Long', 'Long'], 
         'Head Teacher': ['Mr. Bloggs', 'Mr. Plum', 'Mrs. Green', 'Mr. Top', 'Mrs. Blue', 'Mr. Red'], 
         'Premium Course': ['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data1)
**Step1**
生成条件
m = {'Short': 0, 'Medium': 1, 'Long': 2}
cond = df.groupby('Class')['Length'].transform(lambda x: x.index == x.map(m).idxmax())
cond
0     True
1    False
2     True
3    False
4     True
5    False
Name: Length, dtype: bool
**Step2**
编辑列
df['Head Teacher'] = df['Head Teacher'].where(cond).groupby(df['Class']).ffill().bfill()
df['Premium Course'] = df['Premium Course'].where(cond)
df
    Class    Length  Head Teacher Premium Course
0  Maths    Medium  Mr. Bloggs  Yes
1  English  Short   Mrs. Green  NaN
2  English  Long    Mrs. Green  Yes
3  English  Medium  Mrs. Green  NaN
4  Science  Long    Mrs. Blue  Yes
5  Science  Long    Mrs. Blue  NaN
英文:
Example Code
import pandas as pd
data1 = {'Class': ['Maths', 'English', 'English', 'English', 'Science', 'Science'], 
         'Length': ['Medium', 'Short', 'Long', 'Medium', 'Long', 'Long'], 
         'Head Teacher': ['Mr. Bloggs', 'Mr. Plum', 'Mrs. Green', 'Mr. Top', 'Mrs. Blue', 'Mr. Red'], 
         'Premium Course': ['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data1)
Step1
make condition
m = {'Short':0, 'Medium':1, 'Long':2}
cond = df.groupby('Class')['Length'].transform(lambda x: x.index == x.map(m).idxmax())
cond
0     True
1    False
2     True
3    False
4     True
5    False
Name: Length, dtype: bool
Step2
edit columns
df['Head Teacher'] = df['Head Teacher'].where(cond).groupby(df['Class']).ffill().bfill()
df['Premium Course'] = df['Premium Course'].where(cond)
df
    Class	Length	Head Teacher	Premium Course
0	Maths	Medium	Mr. Bloggs	    Yes
1	English	Short	Mrs. Green	    NaN
2	English	Long	Mrs. Green	    Yes
3	English	Medium	Mrs. Green	    NaN
4	Science	Long	Mrs. Blue	    Yes
5	Science	Long	Mrs. Blue	    NaN
答案2
得分: 1
使用有序的 Categorical 来处理 Length 列,这样可以通过 DataFrame.sort_values 和 DataFrame.duplicated 创建掩码, DataFrame.sort_index 用于保留原始行的顺序,并且使用 Series.mask 设置不匹配值的 NaN 值,然后使用 GroupBy.transform 获取第一个非 NaN 值:
df['Length'] = pd.Categorical(df['Length'], 
                              categories=['Long','Medium','Short'],
                              ordered=True)
mask = df.sort_values('Length').duplicated(['Class']).sort_index()
df['Head Teacher'] = df['Head Teacher'].mask(mask).groupby(df['Class']).transform('first')
df['Premium Course'] = df['Premium Course'].mask(mask)
print (df)
     Class  Length Head Teacher Premium Course
0    Maths  Medium   Mr. Bloggs            Yes
1  English   Short   Mrs. Green            NaN
2  English    Long   Mrs. Green            Yes
3  English  Medium   Mrs. Green            NaN
4  Science    Long    Mrs. Blue            Yes
5  Science    Long    Mrs. Blue            NaN
英文:
Use ordered Categorical for Length column, so possible create mask by DataFrame.sort_values and DataFrame.duplicated, DataFrame.sort_index is for original order of rows and set NaNs for not matched values in Series.mask with GroupBy.transform for get first non NaN value:
df['Length'] = pd.Categorical(df['Length'], 
                              categories=['Long','Medium','Short'],
                              ordered=True)
mask = df.sort_values('Length').duplicated(['Class']).sort_index()
df['Head Teacher'] = df['Head Teacher'].mask(mask).groupby(df['Class']).transform('first')
df['Premium Course'] = df['Premium Course'].mask(mask)
print (df)
     Class  Length Head Teacher Premium Course
0    Maths  Medium   Mr. Bloggs            Yes
1  English   Short   Mrs. Green            NaN
2  English    Long   Mrs. Green            Yes
3  English  Medium   Mrs. Green            NaN
4  Science    Long    Mrs. Blue            Yes
5  Science    Long    Mrs. Blue            NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论