英文:
Pandas: Change subset of rows that contain duplicate values for a particular column based on values across all duplicates
问题
我是Pandas的新手,正在尝试理解如何修改具有特定列重复值的子集行,决定要更改哪些行是基于对这些重复项的条件检查而做出的。
假设我有一个(虚构的)DataFrame,如下所示:
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mr. Plum Yes
2 English Long Mrs. Green Yes
3 English Medium Mr. Top Yes
4 Science Long Mrs. Blue Yes
5 Science Long Mr. Red Yes
6 ...
在重复的课程名称处,我想替换所有重复项的教师为最长课程的班主任,并删除不是最长课程的所有重复项的高级课程值。如果重复的课程都是相同长度的,则简单地使用第一个重复项的教师,对于高级课程则相反。
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mrs. Green
2 English Long Mrs. Green Yes
3 English Medium Mrs. Green
4 Science Long Mrs. Blue Yes
5 Science Long Mrs. Blue
6 ...
在Python中,我通常会使用循环、条件语句等构建一个新的内存列表。但我正在尝试确定在Pandas中最佳的方法。
我一直在研究duplicated和groupby函数,但一直无法找到解决方案。任何建议或帮助都将有助于我。试图转向以“向量化”的方式思考。
英文:
I'm new to Pandas and trying to understand how to modify a subset of rows that have duplicate values for a particular column, with the decision of which rows to change being made based on a conditional check across those duplicates.
Say I have a (contrived) dataframe like so:
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mr. Plum Yes
2 English Long Mrs. Green Yes
3 English Medium Mr. Top Yes
4 Science Long Mrs. Blue Yes
5 Science Long Mr. Red Yes
6 ...
Wherever there are duplicate classes I want to replace the Teacher across all the duplicates with the Head Teacher from the longest class, and remove the Premium Course value for all the duplicates that are not the longest class. If the duplicate classes are all the same length, then simply take the teacher from the first duplicate, and the opposite for the Premium Course ie.
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mrs. Green
2 English Long Mrs. Green Yes
3 English Medium Mrs. Green
4 Science Long Mrs. Blue Yes
5 Science Long Mrs. Blue
6 ...
In Python I would typically use loops, conditional statements etc and build a new list in memory. But I'm trying to determine the best approach in pandas.
I've been looking at the duplicated and groupby functions but have been unable to land on a solution. Any advice or help would be helpful. Trying to make the shift into thinking in a "Vectorized" way.
答案1
得分: 1
import pandas as pd
data1 = {'Class': ['Maths', 'English', 'English', 'English', 'Science', 'Science'],
'Length': ['Medium', 'Short', 'Long', 'Medium', 'Long', 'Long'],
'Head Teacher': ['Mr. Bloggs', 'Mr. Plum', 'Mrs. Green', 'Mr. Top', 'Mrs. Blue', 'Mr. Red'],
'Premium Course': ['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data1)
**Step1**
生成条件
m = {'Short': 0, 'Medium': 1, 'Long': 2}
cond = df.groupby('Class')['Length'].transform(lambda x: x.index == x.map(m).idxmax())
cond
0 True
1 False
2 True
3 False
4 True
5 False
Name: Length, dtype: bool
**Step2**
编辑列
df['Head Teacher'] = df['Head Teacher'].where(cond).groupby(df['Class']).ffill().bfill()
df['Premium Course'] = df['Premium Course'].where(cond)
df
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mrs. Green NaN
2 English Long Mrs. Green Yes
3 English Medium Mrs. Green NaN
4 Science Long Mrs. Blue Yes
5 Science Long Mrs. Blue NaN
英文:
Example Code
import pandas as pd
data1 = {'Class': ['Maths', 'English', 'English', 'English', 'Science', 'Science'],
'Length': ['Medium', 'Short', 'Long', 'Medium', 'Long', 'Long'],
'Head Teacher': ['Mr. Bloggs', 'Mr. Plum', 'Mrs. Green', 'Mr. Top', 'Mrs. Blue', 'Mr. Red'],
'Premium Course': ['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data1)
Step1
make condition
m = {'Short':0, 'Medium':1, 'Long':2}
cond = df.groupby('Class')['Length'].transform(lambda x: x.index == x.map(m).idxmax())
cond
0 True
1 False
2 True
3 False
4 True
5 False
Name: Length, dtype: bool
Step2
edit columns
df['Head Teacher'] = df['Head Teacher'].where(cond).groupby(df['Class']).ffill().bfill()
df['Premium Course'] = df['Premium Course'].where(cond)
df
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mrs. Green NaN
2 English Long Mrs. Green Yes
3 English Medium Mrs. Green NaN
4 Science Long Mrs. Blue Yes
5 Science Long Mrs. Blue NaN
答案2
得分: 1
使用有序的 Categorical
来处理 Length
列,这样可以通过 DataFrame.sort_values
和 DataFrame.duplicated
创建掩码, DataFrame.sort_index
用于保留原始行的顺序,并且使用 Series.mask
设置不匹配值的 NaN
值,然后使用 GroupBy.transform
获取第一个非 NaN
值:
df['Length'] = pd.Categorical(df['Length'],
categories=['Long','Medium','Short'],
ordered=True)
mask = df.sort_values('Length').duplicated(['Class']).sort_index()
df['Head Teacher'] = df['Head Teacher'].mask(mask).groupby(df['Class']).transform('first')
df['Premium Course'] = df['Premium Course'].mask(mask)
print (df)
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mrs. Green NaN
2 English Long Mrs. Green Yes
3 English Medium Mrs. Green NaN
4 Science Long Mrs. Blue Yes
5 Science Long Mrs. Blue NaN
英文:
Use ordered Categorical
for Length
column, so possible create mask by DataFrame.sort_values
and DataFrame.duplicated
, DataFrame.sort_index
is for original order of rows and set NaN
s for not matched values in Series.mask
with GroupBy.transform
for get first non NaN
value:
df['Length'] = pd.Categorical(df['Length'],
categories=['Long','Medium','Short'],
ordered=True)
mask = df.sort_values('Length').duplicated(['Class']).sort_index()
df['Head Teacher'] = df['Head Teacher'].mask(mask).groupby(df['Class']).transform('first')
df['Premium Course'] = df['Premium Course'].mask(mask)
print (df)
Class Length Head Teacher Premium Course
0 Maths Medium Mr. Bloggs Yes
1 English Short Mrs. Green NaN
2 English Long Mrs. Green Yes
3 English Medium Mrs. Green NaN
4 Science Long Mrs. Blue Yes
5 Science Long Mrs. Blue NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论