英文:
Fill columns based on groups and conditions in another column
问题
```plaintext
考虑以下的Pandas `DataFrame`
我试图根据`Make`更新`Code`列。如果`Code`列的值是`None`,则必须根据相同`Make`的其他`Code`列值正确填充它。换句话说,如果任何`Make`在`Code`列中定义了一个值,那么该值应该用于填充`Code`列中的`None`值,而且如果在相同的`Make`中将`BG`或`_BG`附加到任何`Code`值,那么所有`Code`值都应该附加上相应的`BG`或`_BG`,其中`BG`优先于`_BG`。
由于`BMW`已经存在的`BMW`代码值中没有`BG`或`_BG`,所以在替换`None`时,它不会附加`BG`或`_BG`。对于`Ford`,对于其中一个值,`_BG`存在于一个值中,对于另一个值,`FRD`存在,因此`Ford`的所有`Code`值应该是`FRD_BG`。
对于Mercedes,存在带有`BG`附加到Code的`Code`值,因此如果任何`Code`值具有在`BG`前加`_`,则应从`Code`值中删除`_`。
类似地,对于Jeep,我期望输出对于所有`Jeep`的`Make`是`JeepBG`,因为如果`BG`和`_BG`同时出现,`BG`优先于`_BG`。
每个制造商在组中肯定会指定一个代码值。
我的解决方案如下:
输出中对于所有的Jeep Code,`_BG`被填充,但应该是`JeepBG`。
期望的输出是:
```plaintext
英文:
Consider the below Pandas DataFrame
df = pd.DataFrame({'Make': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes','Jeep','Jeep','Jeep'],
'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','3 Series','7 Series','C-Class','C-Class','S-Class','Wrangler','Compass','Patriot'],
'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020,2020,2021,2020],
'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000,60000,45000,40000],
'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black','Grey','Brown','Green'],
'Code' : ['TSLABG','TSLA',None,'TYTA','FRD','_BG',None,None,'BMW','BMW','MercedesBG','Mercedes_BG','MercedesBG',None,'_BG','JeepBG']
})
df
Make Type Year Price Color Code
0 Tesla Model X 2015 85000 White TSLABG
1 Tesla Model X 2015 90000 White TSLA
2 Tesla Model X 2015 95000 White None
3 Toyota Corolla 2017 20000 Red TYTA
4 Ford Bronco 2018 35000 Blue FRD
5 Ford Bronco 2018 35000 Blue _BG
6 Ford Mustang 2020 45000 Yellow None
7 BMW 3 Series 2015 40000 Silver None
8 BMW 3 Series 2015 40000 Silver BMW
9 BMW 7 Series 2017 65000 Black BMW
10 Mercedes C-Class 2018 50000 White MercedesBG
11 Mercedes C-Class 2018 50000 White Mercedes_BG
12 Mercedes S-Class 2020 75000 Black MercedesBG
13 Jeep Wrangler 2020 60000 Grey None
14 Jeep Compass 2021 45000 Brown _BG
15 Jeep Patriot 2020 40000 Green JeepBG
I am trying to update the Code
Column based on the Make
. If Code
column is having None
, it must be correctly filled based on the other values of Code
column for the same Make
. In other words if any Make
is having Code
defined in the Code
column that value should be used to fill None
value in the Code
column, also if BG
or _BG
is appended to any code value of the same Make
, all the Code
values should be appended with the BG
or _BG
respectively for the same Make
with BG
taking precedence over _BG
Since BMW
is having no BG
or _BG
for already existing BMW
code values, when replacing the None
it doesn't get appended with BG
or _BG
. For Ford _BG
is present for one of the values and FRD
is present for another value so all Code
values of Ford
should be FRD_BG
.
For Mercedes there are Code
values with BG
appended to the Code, so if any Code
value is having _
prepended to the BG
, the _
should be removed from the Code
Value
Similarly for Jeep
I am expecting the output to be JeepBG
for all the Jeep
Make
since if 'BG' and '_BG' comes up BG
takes priority over _BG
Each make would definitely have a code value specified in one of the entries of the group.
The solution I have tried
code = (df['Code'].str.split('(BG|_BG)', expand=True).add_prefix('part').replace('-', None).groupby(df['Make']).transform('first').fillna('').agg(''.join, axis=1))
df['Code'] = code
df
The output I got has _BG
being populated for all the Jeep
Code
and it should've been JeepBG
.
Make Type Year Price Color Code
0 Tesla Model X 2015 85000 White TSLABG
1 Tesla Model X 2015 90000 White TSLABG
2 Tesla Model X 2015 95000 White TSLABG
3 Toyota Corolla 2017 20000 Red TYTA
4 Ford Bronco 2018 35000 Blue FRD_BG
5 Ford Bronco 2018 35000 Blue FRD_BG
6 Ford Mustang 2020 45000 Yellow FRD_BG
7 BMW 3 Series 2015 40000 Silver BMW
8 BMW 3 Series 2015 40000 Silver BMW
9 BMW 7 Series 2017 65000 Black BMW
10 Mercedes C-Class 2018 50000 White MercedesBG
11 Mercedes C-Class 2018 50000 White MercedesBG
12 Mercedes S-Class 2020 75000 Black MercedesBG
13 Jeep Wrangler 2020 60000 Grey _BG
14 Jeep Compass 2021 45000 Brown _BG
15 Jeep Patriot 2020 40000 Green _BG
The expected output is:
Make Type Year Price Color Code
0 Tesla Model X 2015 85000 White TSLABG
1 Tesla Model X 2015 90000 White TSLABG
2 Tesla Model X 2015 95000 White TSLABG
3 Toyota Corolla 2017 20000 Red TYTA
4 Ford Bronco 2018 35000 Blue FRD_BG
5 Ford Bronco 2018 35000 Blue FRD_BG
6 Ford Mustang 2020 45000 Yellow FRD_BG
7 BMW 3 Series 2015 40000 Silver BMW
8 BMW 3 Series 2015 40000 Silver BMW
9 BMW 7 Series 2017 65000 Black BMW
10 Mercedes C-Class 2018 50000 White MercedesBG
11 Mercedes C-Class 2018 50000 White MercedesBG
12 Mercedes S-Class 2020 75000 Black MercedesBG
13 Jeep Wrangler 2020 60000 Grey JeepBG
14 Jeep Compass 2021 45000 Brown JeepBG
15 Jeep Patriot 2020 40000 Green JeepBG
答案1
得分: 1
Let's do this in steps, since there are many requirements:
First, let's define columns is_bg
and isbg
, which will aid us throughout the process:
grouper = df.groupby('Make')['Code']
df['is_bg'] = grouper.transform(lambda s: s.str.endswith('_BG').astype(bool))
df['isbg'] = grouper.transform(lambda s: s.str.endswith('BG')
& ~s.str.endswith('_BG').astype(bool))
Now, let's remove _BG
and BG
suffixes so that we can find the unique identifier for each Code
:
df.loc[df.isbg, 'Code'] = df.loc[df.isbg, 'Code'].str[:-2]
df.loc[df.is_bg, 'Code'] = df.loc[df.is_bg, 'Code'].str[:-3]
Now, we create a mapper and assign:
mapper = df.groupby('Make').apply(lambda s:
# Retrieve the unique code
s.loc展开收缩.str.len() > 1,'Code'].iloc[0] +
# Append BG if `isbg`
('BG' if s.isbg.any() else
# Otherwise, append _BG if `is_bg`
('_BG' if s.is_bg.any() else '')))
df['New_Code'] = df['Make'].map(mapper)
Notice that there's an assumption here that for each Make
, we'll only have one entry in Code
that, after having _BG
and BG
removed, will NOT be either whitespace or None
. If this is not true, then the problem is ambiguous, and you'll need to explore how to find the unique Code identifier.
英文:
Let's do this in steps, since there are many requirements
First, let's define columns is_bg
and isbg
, which will aid us throughout the process:
grouper = df.groupby('Make')['Code']
df['is_bg'] = grouper.transform(lambda s: s.str.endswith('_BG').astype(bool))
df['isbg'] = grouper.transform(lambda s: s.str.endswith('BG')
& ~s.str.endswith('_BG').astype(bool))
Now, let's remove _BG
and BG
suffixes so that we can find the unique identifier for each Code
:
df.loc[df.isbg, 'Code'] = df.loc[df.isbg, 'Code'].str[:-2]
df.loc[df.is_bg, 'Code'] = df.loc[df.is_bg, 'Code'].str[:-3]
Now, we create a mapper and assign:
mapper = df.groupby('Make').apply(lambda s:
# Retrieve the unique code
s.loc展开收缩.str.len() > 1,'Code'].iloc[0] +
# Append BG if `isbg`
('BG' if s.isbg.any() else
# Otherwise, append _BG if `is_bg`
('_BG' if s.is_bg.any() else '')))
df['New_Code'] = df['Make'].map(mapper)
Notice that there's an assumption here that for each Make
, we'll only have one entry in Code
that, after having _BG
and BG
removed, will NOT be either whitespace or None
. If this is not true, then the problem is ambiguous and you'll need to explore how to find the unique Code identifier.
Make Type Year Price Code isbg is_bg New_Code
0 Tesla Model X 2015 85000 TSLA True False TSLABG
1 Tesla Model X 2015 90000 TSLA False False TSLABG
2 Tesla Model X 2015 95000 None False False TSLABG
3 Toyota Corolla 2017 20000 TYTA False False TYTA
4 Ford Bronco 2018 35000 FRD False False FRD_BG
5 Ford Bronco 2018 35000 False True FRD_BG
6 Ford Mustang 2020 45000 None False False FRD_BG
7 BMW 3 Series 2015 40000 None False False BMW
8 BMW 3 Series 2015 40000 BMW False False BMW
9 BMW 7 Series 2017 65000 BMW False False BMW
10 Mercedes C-Class 2018 50000 Mercedes True False MercedesBG
11 Mercedes C-Class 2018 50000 Mercedes False True MercedesBG
12 Mercedes S-Class 2020 75000 Mercedes True False MercedesBG
13 Jeep Wrangler 2020 60000 None False False JeepBG
14 Jeep Compass 2021 45000 False True JeepBG
15 Jeep Patriot 2020 40000 Jeep True False JeepBG
答案2
得分: 1
remove _BG and BG / fill value except BG
s = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')
s
0 TSLA
1 TSLA
2 TSLA
3 TYTA
4 FRD
5 FRD
6 FRD
7 BMW
8 BMW
9 BMW
10 Mercedes
11 Mercedes
12 Mercedes
13 Jeep
14 Jeep
15 Jeep
extract BG and _BG and so on..
df1 = df['Code'].str.extract(r'[^_]+(BG)|(_BG)').groupby(df['Make']).ffill().groupby(df['Make']).bfill()
df1
0 1
0 BG NaN
1 BG NaN
2 BG NaN
3 NaN NaN
4 NaN _BG
5 NaN _BG
6 NaN _BG
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 BG _BG
11 BG _BG
12 BG _BG
13 BG _BG
14 BG _BG
15 BG _BG
fillna df1 and string concat
df.assign(Code=s.str.cat(df1[0].fillna(df1[1]).fillna('')))
out
Make Type Year Price Color Code
0 Tesla Model X 2015 85000 White TSLABG
1 Tesla Model X 2015 90000 White TSLABG
2 Tesla Model X 2015 95000 White TSLABG
3 Toyota Corolla 2017 20000 Red TYTA
4 Ford Bronco 2018 35000 Blue FRD_BG
5 Ford Bronco 2018 35000 Blue FRD_BG
6 Ford Mustang 2020 45000 Yellow FRD_BG
7 BMW 3 Series 2015 40000 Silver BMW
8 BMW 3 Series 2015 40000 Silver BMW
9 BMW 7 Series 2017 65000 Black BMW
10 Mercedes C-Class 2018 50000 White MercedesBG
11 Mercedes C-Class 2018 50000 White MercedesBG
12 Mercedes S-Class 2020 75000 Black MercedesBG
13 Jeep Wrangler 2020 60000 Grey JeepBG
14 Jeep Compass 2021 45000 Brown JeepBG
15 Jeep Patriot 2020 40000 Green JeepBG
英文:
remove _BG and BG / fill value except BG
s = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')
s
0 TSLA
1 TSLA
2 TSLA
3 TYTA
4 FRD
5 FRD
6 FRD
7 BMW
8 BMW
9 BMW
10 Mercedes
11 Mercedes
12 Mercedes
13 Jeep
14 Jeep
15 Jeep
extract BG and _BG and so on..
df1 = df['Code'].str.extract(r'[^_]+(BG)|(_BG)').groupby(df['Make']).ffill().groupby(df['Make']).bfill()
df1
0 1
0 BG NaN
1 BG NaN
2 BG NaN
3 NaN NaN
4 NaN _BG
5 NaN _BG
6 NaN _BG
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 BG _BG
11 BG _BG
12 BG _BG
13 BG _BG
14 BG _BG
15 BG _BG
fillna df1 and string concat
df.assign(Code=s.str.cat(df1[0].fillna(df1[1]).fillna('')))
out
Make Type Year Price Color Code
0 Tesla Model X 2015 85000 White TSLABG
1 Tesla Model X 2015 90000 White TSLABG
2 Tesla Model X 2015 95000 White TSLABG
3 Toyota Corolla 2017 20000 Red TYTA
4 Ford Bronco 2018 35000 Blue FRD_BG
5 Ford Bronco 2018 35000 Blue FRD_BG
6 Ford Mustang 2020 45000 Yellow FRD_BG
7 BMW 3 Series 2015 40000 Silver BMW
8 BMW 3 Series 2015 40000 Silver BMW
9 BMW 7 Series 2017 65000 Black BMW
10 Mercedes C-Class 2018 50000 White MercedesBG
11 Mercedes C-Class 2018 50000 White MercedesBG
12 Mercedes S-Class 2020 75000 Black MercedesBG
13 Jeep Wrangler 2020 60000 Grey JeepBG
14 Jeep Compass 2021 45000 Brown JeepBG
15 Jeep Patriot 2020 40000 Green JeepBG
答案3
得分: 0
Building off of Panda Kim's solution:
lhs = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')
rhs = df['Code'].str.extract(r'[^_]+(BG)|(_BG)', expand=False).groupby(df['Make']).ffill().groupby(df['Make']).bfill()
rhs = rhs[0].fillna((rhs[1]).fillna(''))
df['Code'] = lhs + rhs
英文:
Building off of Panda Kim's solution
lhs = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')
rhs = df['Code'].str.extract(r'[^_]+(BG)|(_BG)',expand =False).groupby(df['Make']).ffill().groupby(df['Make']).bfill()
rhs = rhs[0].fillna((rhs[1]).fillna(''))
df['Code'] = lhs + rhs
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论