根据另一列中的分组和条件填充列。

huangapple go评论72阅读模式
英文:

Fill columns based on groups and conditions in another column

问题

```plaintext
考虑以下的Pandas `DataFrame`

我试图根据`Make`更新`Code`列。如果`Code`列的值是`None`,则必须根据相同`Make`的其他`Code`列值正确填充它。换句话说,如果任何`Make`在`Code`列中定义了一个值,那么该值应该用于填充`Code`列中的`None`值,而且如果在相同的`Make`中将`BG`或`_BG`附加到任何`Code`值,那么所有`Code`值都应该附加上相应的`BG`或`_BG`,其中`BG`优先于`_BG`。

由于`BMW`已经存在的`BMW`代码值中没有`BG`或`_BG`,所以在替换`None`时,它不会附加`BG`或`_BG`。对于`Ford`,对于其中一个值,`_BG`存在于一个值中,对于另一个值,`FRD`存在,因此`Ford`的所有`Code`值应该是`FRD_BG`。

对于Mercedes,存在带有`BG`附加到Code的`Code`值,因此如果任何`Code`值具有在`BG`前加`_`,则应从`Code`值中删除`_`。

类似地,对于Jeep,我期望输出对于所有`Jeep`的`Make`是`JeepBG`,因为如果`BG`和`_BG`同时出现,`BG`优先于`_BG`。

每个制造商在组中肯定会指定一个代码值。

我的解决方案如下:

输出中对于所有的Jeep Code,`_BG`被填充,但应该是`JeepBG`。

期望的输出是:
```plaintext
英文:

Consider the below Pandas DataFrame

df = pd.DataFrame({'Make': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes','Jeep','Jeep','Jeep'],
                   'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','3 Series','7 Series','C-Class','C-Class','S-Class','Wrangler','Compass','Patriot'],
                   'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020,2020,2021,2020],
                   'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000,60000,45000,40000],
                   'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black','Grey','Brown','Green'],
                   'Code'  : ['TSLABG','TSLA',None,'TYTA','FRD','_BG',None,None,'BMW','BMW','MercedesBG','Mercedes_BG','MercedesBG',None,'_BG','JeepBG']
                  })
df
     Make	Type	   Year	Price   Color	Code
0	Tesla   Model X	   2015	85000	White	TSLABG
1	Tesla   Model X    2015	90000	White	TSLA
2	Tesla	Model X	   2015	95000	White	None
3   Toyota	Corolla	   2017	20000	Red	    TYTA
4	 Ford	Bronco	   2018	35000	Blue	FRD 
5	 Ford	Bronco 	   2018	35000	Blue    _BG
6	 Ford	Mustang	   2020	45000	Yellow	None
7	  BMW   3 Series   2015	40000	Silver	None
8	  BMW   3 Series   2015	40000	Silver	BMW
9	  BMW	7 Series   2017	65000	Black	BMW
10 Mercedes	C-Class	   2018	50000	White	MercedesBG
11 Mercedes	C-Class	   2018	50000	White	Mercedes_BG
12 Mercedes	S-Class	   2020	75000	Black	MercedesBG
13   Jeep   Wrangler   2020 60000   Grey    None
14   Jeep   Compass    2021 45000   Brown   _BG
15   Jeep   Patriot    2020 40000   Green   JeepBG

I am trying to update the Code Column based on the Make. If Code column is having None, it must be correctly filled based on the other values of Code column for the same Make. In other words if any Make is having Code defined in the Code column that value should be used to fill None value in the Code column, also if BG or _BG is appended to any code value of the same Make, all the Code values should be appended with the BG or _BG respectively for the same Make with BG taking precedence over _BG

Since BMW is having no BG or _BG for already existing BMW code values, when replacing the None it doesn't get appended with BG or _BG. For Ford _BG is present for one of the values and FRD is present for another value so all Code values of Ford should be FRD_BG.

For Mercedes there are Code values with BG appended to the Code, so if any Code value is having _ prepended to the BG, the _ should be removed from the Code Value

Similarly for Jeep I am expecting the output to be JeepBG for all the Jeep Make since if 'BG' and '_BG' comes up BG takes priority over _BG

Each make would definitely have a code value specified in one of the entries of the group.

The solution I have tried

code = (df['Code'].str.split('(BG|_BG)', expand=True).add_prefix('part').replace('-', None).groupby(df['Make']).transform('first').fillna('').agg(''.join, axis=1))
df['Code'] = code
df

The output I got has _BG being populated for all the Jeep Code and it should've been JeepBG.

     Make	Type	   Year	Price   Color	Code
0	Tesla   Model X	   2015	85000	White	TSLABG
1	Tesla   Model X    2015	90000	White	TSLABG
2	Tesla	Model X	   2015	95000	White	TSLABG
3   Toyota	Corolla	   2017	20000	Red	    TYTA
4	 Ford	Bronco	   2018	35000	Blue    FRD_BG
5	 Ford	Bronco 	   2018	35000	Blue	FRD_BG
6	 Ford	Mustang	   2020	45000	Yellow	FRD_BG
7	  BMW   3 Series   2015	40000	Silver	BMW
8	  BMW   3 Series   2015	40000	Silver	BMW
9	  BMW	7 Series   2017	65000	Black	BMW
10 Mercedes	C-Class	   2018	50000	White	MercedesBG
11 Mercedes	C-Class	   2018	50000	White	MercedesBG
12 Mercedes	S-Class	   2020	75000	Black	MercedesBG
13   Jeep   Wrangler   2020 60000   Grey    _BG
14   Jeep   Compass    2021 45000   Brown   _BG
15   Jeep   Patriot    2020 40000   Green   _BG

The expected output is:

     Make	Type	   Year	Price   Color	Code
0	Tesla   Model X	   2015	85000	White	TSLABG
1	Tesla   Model X    2015	90000	White	TSLABG
2	Tesla	Model X	   2015	95000	White	TSLABG
3   Toyota	Corolla	   2017	20000	Red	    TYTA
4	 Ford	Bronco	   2018	35000	Blue    FRD_BG
5	 Ford	Bronco 	   2018	35000	Blue	FRD_BG
6	 Ford	Mustang	   2020	45000	Yellow	FRD_BG
7	  BMW   3 Series   2015	40000	Silver	BMW
8	  BMW   3 Series   2015	40000	Silver	BMW
9	  BMW	7 Series   2017	65000	Black	BMW
10 Mercedes	C-Class	   2018	50000	White	MercedesBG
11 Mercedes	C-Class	   2018	50000	White	MercedesBG
12 Mercedes	S-Class	   2020	75000	Black	MercedesBG
13   Jeep   Wrangler   2020 60000   Grey    JeepBG
14   Jeep   Compass    2021 45000   Brown   JeepBG
15   Jeep   Patriot    2020 40000   Green   JeepBG

答案1

得分: 1

Let's do this in steps, since there are many requirements:

First, let's define columns is_bg and isbg, which will aid us throughout the process:

grouper = df.groupby('Make')['Code']
df['is_bg'] = grouper.transform(lambda s: s.str.endswith('_BG').astype(bool))
df['isbg'] = grouper.transform(lambda s: s.str.endswith('BG') 
                               & ~s.str.endswith('_BG').astype(bool))

Now, let's remove _BG and BG suffixes so that we can find the unique identifier for each Code:

df.loc[df.isbg, 'Code'] = df.loc[df.isbg, 'Code'].str[:-2]
df.loc[df.is_bg, 'Code'] = df.loc[df.is_bg, 'Code'].str[:-3]

Now, we create a mapper and assign:

mapper = df.groupby('Make').apply(lambda s:
                               # Retrieve the unique code
                               s.loc
展开收缩
.str.len() > 1,'Code'].iloc[0] +
# Append BG if `isbg` ('BG' if s.isbg.any() else # Otherwise, append _BG if `is_bg` ('_BG' if s.is_bg.any() else ''))) df['New_Code'] = df['Make'].map(mapper)

Notice that there's an assumption here that for each Make, we'll only have one entry in Code that, after having _BG and BG removed, will NOT be either whitespace or None. If this is not true, then the problem is ambiguous, and you'll need to explore how to find the unique Code identifier.

英文:

Let's do this in steps, since there are many requirements

First, let's define columns is_bg and isbg, which will aid us throughout the process:

grouper = df.groupby('Make')['Code']
df['is_bg'] = grouper.transform(lambda s: s.str.endswith('_BG').astype(bool))
df['isbg']  = grouper.transform(lambda s: s.str.endswith('BG') 
                                   & ~s.str.endswith('_BG').astype(bool))

Now, let's remove _BG and BG suffixes so that we can find the unique identifier for each Code:

df.loc[df.isbg,  'Code'] = df.loc[df.isbg,  'Code'].str[:-2]
df.loc[df.is_bg, 'Code'] = df.loc[df.is_bg, 'Code'].str[:-3]

Now, we create a mapper and assign:

mapper = df.groupby('Make').apply(lambda s:
                           # Retrieve the unique code
                           s.loc
展开收缩
.str.len() > 1,'Code'].iloc[0] + # Append BG if `isbg` ('BG' if s.isbg.any() else # Otherwise, append _BG if `is_bg` ('_BG' if s.is_bg.any() else ''))) df['New_Code'] = df['Make'].map(mapper)

Notice that there's an assumption here that for each Make, we'll only have one entry in Code that, after having _BG and BG removed, will NOT be either whitespace or None. If this is not true, then the problem is ambiguous and you'll need to explore how to find the unique Code identifier.


        Make      Type  Year  Price      Code   isbg  is_bg    New_Code
0      Tesla   Model X  2015  85000      TSLA   True  False      TSLABG
1      Tesla   Model X  2015  90000      TSLA  False  False      TSLABG
2      Tesla   Model X  2015  95000      None  False  False      TSLABG
3     Toyota   Corolla  2017  20000      TYTA  False  False        TYTA
4       Ford    Bronco  2018  35000       FRD  False  False      FRD_BG
5       Ford    Bronco  2018  35000            False   True      FRD_BG
6       Ford   Mustang  2020  45000      None  False  False      FRD_BG
7        BMW  3 Series  2015  40000      None  False  False         BMW
8        BMW  3 Series  2015  40000       BMW  False  False         BMW
9        BMW  7 Series  2017  65000       BMW  False  False         BMW
10  Mercedes   C-Class  2018  50000  Mercedes   True  False  MercedesBG
11  Mercedes   C-Class  2018  50000  Mercedes  False   True  MercedesBG
12  Mercedes   S-Class  2020  75000  Mercedes   True  False  MercedesBG
13      Jeep  Wrangler  2020  60000      None  False  False      JeepBG
14      Jeep   Compass  2021  45000            False   True      JeepBG
15      Jeep   Patriot  2020  40000      Jeep   True  False      JeepBG

答案2

得分: 1

remove _BG and BG / fill value except BG

s = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

s

0         TSLA
1         TSLA
2         TSLA
3         TYTA
4          FRD
5          FRD
6          FRD
7          BMW
8          BMW
9          BMW
10    Mercedes
11    Mercedes
12    Mercedes
13        Jeep
14        Jeep
15        Jeep

extract BG and _BG and so on..

df1 = df['Code'].str.extract(r'[^_]+(BG)|(_BG)').groupby(df['Make']).ffill().groupby(df['Make']).bfill()

df1

    	0	1
0	BG	NaN
1	BG	NaN
2	BG	NaN
3	NaN	NaN
4	NaN	_BG
5	NaN	_BG
6	NaN	_BG
7	NaN	NaN
8	NaN	NaN
9	NaN	NaN
10	BG	_BG
11	BG	_BG
12	BG	_BG
13	BG	_BG
14	BG	_BG
15	BG	_BG

fillna df1 and string concat

df.assign(Code=s.str.cat(df1[0].fillna(df1[1]).fillna('')))

out

    	Make	    Type	    Year	Price	Color	Code
0	Tesla	    Model X	    2015	85000	White	TSLABG
1	Tesla	    Model X	    2015	90000	White	TSLABG
2	Tesla	    Model X	    2015	95000	White	TSLABG
3	Toyota	    Corolla	    2017	20000	Red	    TYTA
4	Ford	    Bronco	    2018	35000	Blue	FRD_BG
5	Ford	    Bronco	    2018	35000	Blue	FRD_BG
6	Ford	    Mustang	    2020	45000	Yellow	FRD_BG
7	BMW	        3 Series	2015	40000	Silver	BMW
8	BMW	        3 Series	2015	40000	Silver	BMW
9	BMW	        7 Series	2017	65000	Black	BMW
10	Mercedes	C-Class	    2018	50000	White	MercedesBG
11	Mercedes	C-Class	    2018	50000	White	MercedesBG
12	Mercedes	S-Class	    2020	75000	Black	MercedesBG
13	Jeep	    Wrangler	2020	60000	Grey	JeepBG
14	Jeep	    Compass	    2021	45000	Brown	JeepBG
15	Jeep	    Patriot	    2020	40000	Green	JeepBG
英文:

remove _BG and BG / fill value except BG

s = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

s

0         TSLA
1         TSLA
2         TSLA
3         TYTA
4          FRD
5          FRD
6          FRD
7          BMW
8          BMW
9          BMW
10    Mercedes
11    Mercedes
12    Mercedes
13        Jeep
14        Jeep
15        Jeep

extract BG and _BG and so on..

df1 = df['Code'].str.extract(r'[^_]+(BG)|(_BG)').groupby(df['Make']).ffill().groupby(df['Make']).bfill()

df1

	0	1
0	BG	NaN
1	BG	NaN
2	BG	NaN
3	NaN	NaN
4	NaN	_BG
5	NaN	_BG
6	NaN	_BG
7	NaN	NaN
8	NaN	NaN
9	NaN	NaN
10	BG	_BG
11	BG	_BG
12	BG	_BG
13	BG	_BG
14	BG	_BG
15	BG	_BG

fillna df1 and string concat

df.assign(Code=s.str.cat(df1[0].fillna(df1[1]).fillna('')))

out

	Make	    Type	    Year	Price	Color	Code
0	Tesla	    Model X	    2015	85000	White	TSLABG
1	Tesla	    Model X	    2015	90000	White	TSLABG
2	Tesla	    Model X	    2015	95000	White	TSLABG
3	Toyota	    Corolla	    2017	20000	Red	    TYTA
4	Ford	    Bronco	    2018	35000	Blue	FRD_BG
5	Ford	    Bronco	    2018	35000	Blue	FRD_BG
6	Ford	    Mustang	    2020	45000	Yellow	FRD_BG
7	BMW	        3 Series	2015	40000	Silver	BMW
8	BMW	        3 Series	2015	40000	Silver	BMW
9	BMW	        7 Series	2017	65000	Black	BMW
10	Mercedes	C-Class	    2018	50000	White	MercedesBG
11	Mercedes	C-Class	    2018	50000	White	MercedesBG
12	Mercedes	S-Class	    2020	75000	Black	MercedesBG
13	Jeep	    Wrangler	2020	60000	Grey	JeepBG
14	Jeep	    Compass	    2021	45000	Brown	JeepBG
15	Jeep	    Patriot	    2020	40000	Green	JeepBG

答案3

得分: 0

Building off of Panda Kim's solution:

lhs = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

rhs = df['Code'].str.extract(r'[^_]+(BG)|(_BG)', expand=False).groupby(df['Make']).ffill().groupby(df['Make']).bfill()

rhs = rhs[0].fillna((rhs[1]).fillna(''))

df['Code'] = lhs + rhs
英文:

Building off of Panda Kim's solution

lhs = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

rhs = df['Code'].str.extract(r'[^_]+(BG)|(_BG)',expand =False).groupby(df['Make']).ffill().groupby(df['Make']).bfill()

rhs = rhs[0].fillna((rhs[1]).fillna(''))

df['Code'] = lhs + rhs

huangapple
  • 本文由 发表于 2023年5月10日 23:32:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220259.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定