根据另一列中的分组和条件填充列。

huangapple go评论106阅读模式
英文:

Fill columns based on groups and conditions in another column

问题

  1. ```plaintext
  2. 考虑以下的Pandas `DataFrame`
  3. 我试图根据`Make`更新`Code`列。如果`Code`列的值是`None`,则必须根据相同`Make`的其他`Code`列值正确填充它。换句话说,如果任何`Make`在`Code`列中定义了一个值,那么该值应该用于填充`Code`列中的`None`值,而且如果在相同的`Make`中将`BG`或`_BG`附加到任何`Code`值,那么所有`Code`值都应该附加上相应的`BG`或`_BG`,其中`BG`优先于`_BG`。
  4. 由于`BMW`已经存在的`BMW`代码值中没有`BG`或`_BG`,所以在替换`None`时,它不会附加`BG`或`_BG`。对于`Ford`,对于其中一个值,`_BG`存在于一个值中,对于另一个值,`FRD`存在,因此`Ford`的所有`Code`值应该是`FRD_BG`。
  5. 对于Mercedes,存在带有`BG`附加到Code的`Code`值,因此如果任何`Code`值具有在`BG`前加`_`,则应从`Code`值中删除`_`。
  6. 类似地,对于Jeep,我期望输出对于所有`Jeep`的`Make`是`JeepBG`,因为如果`BG`和`_BG`同时出现,`BG`优先于`_BG`。
  7. 每个制造商在组中肯定会指定一个代码值。
  8. 我的解决方案如下:
  9. 输出中对于所有的Jeep Code,`_BG`被填充,但应该是`JeepBG`。
  10. 期望的输出是:
  11. ```plaintext
英文:

Consider the below Pandas DataFrame

  1. df = pd.DataFrame({'Make': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes','Jeep','Jeep','Jeep'],
  2. 'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','3 Series','7 Series','C-Class','C-Class','S-Class','Wrangler','Compass','Patriot'],
  3. 'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020,2020,2021,2020],
  4. 'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000,60000,45000,40000],
  5. 'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black','Grey','Brown','Green'],
  6. 'Code' : ['TSLABG','TSLA',None,'TYTA','FRD','_BG',None,None,'BMW','BMW','MercedesBG','Mercedes_BG','MercedesBG',None,'_BG','JeepBG']
  7. })
  1. df
  2. Make Type Year Price Color Code
  3. 0 Tesla Model X 2015 85000 White TSLABG
  4. 1 Tesla Model X 2015 90000 White TSLA
  5. 2 Tesla Model X 2015 95000 White None
  6. 3 Toyota Corolla 2017 20000 Red TYTA
  7. 4 Ford Bronco 2018 35000 Blue FRD
  8. 5 Ford Bronco 2018 35000 Blue _BG
  9. 6 Ford Mustang 2020 45000 Yellow None
  10. 7 BMW 3 Series 2015 40000 Silver None
  11. 8 BMW 3 Series 2015 40000 Silver BMW
  12. 9 BMW 7 Series 2017 65000 Black BMW
  13. 10 Mercedes C-Class 2018 50000 White MercedesBG
  14. 11 Mercedes C-Class 2018 50000 White Mercedes_BG
  15. 12 Mercedes S-Class 2020 75000 Black MercedesBG
  16. 13 Jeep Wrangler 2020 60000 Grey None
  17. 14 Jeep Compass 2021 45000 Brown _BG
  18. 15 Jeep Patriot 2020 40000 Green JeepBG

I am trying to update the Code Column based on the Make. If Code column is having None, it must be correctly filled based on the other values of Code column for the same Make. In other words if any Make is having Code defined in the Code column that value should be used to fill None value in the Code column, also if BG or _BG is appended to any code value of the same Make, all the Code values should be appended with the BG or _BG respectively for the same Make with BG taking precedence over _BG

Since BMW is having no BG or _BG for already existing BMW code values, when replacing the None it doesn't get appended with BG or _BG. For Ford _BG is present for one of the values and FRD is present for another value so all Code values of Ford should be FRD_BG.

For Mercedes there are Code values with BG appended to the Code, so if any Code value is having _ prepended to the BG, the _ should be removed from the Code Value

Similarly for Jeep I am expecting the output to be JeepBG for all the Jeep Make since if 'BG' and '_BG' comes up BG takes priority over _BG

Each make would definitely have a code value specified in one of the entries of the group.

The solution I have tried

  1. code = (df['Code'].str.split('(BG|_BG)', expand=True).add_prefix('part').replace('-', None).groupby(df['Make']).transform('first').fillna('').agg(''.join, axis=1))
  2. df['Code'] = code
  3. df

The output I got has _BG being populated for all the Jeep Code and it should've been JeepBG.

  1. Make Type Year Price Color Code
  2. 0 Tesla Model X 2015 85000 White TSLABG
  3. 1 Tesla Model X 2015 90000 White TSLABG
  4. 2 Tesla Model X 2015 95000 White TSLABG
  5. 3 Toyota Corolla 2017 20000 Red TYTA
  6. 4 Ford Bronco 2018 35000 Blue FRD_BG
  7. 5 Ford Bronco 2018 35000 Blue FRD_BG
  8. 6 Ford Mustang 2020 45000 Yellow FRD_BG
  9. 7 BMW 3 Series 2015 40000 Silver BMW
  10. 8 BMW 3 Series 2015 40000 Silver BMW
  11. 9 BMW 7 Series 2017 65000 Black BMW
  12. 10 Mercedes C-Class 2018 50000 White MercedesBG
  13. 11 Mercedes C-Class 2018 50000 White MercedesBG
  14. 12 Mercedes S-Class 2020 75000 Black MercedesBG
  15. 13 Jeep Wrangler 2020 60000 Grey _BG
  16. 14 Jeep Compass 2021 45000 Brown _BG
  17. 15 Jeep Patriot 2020 40000 Green _BG

The expected output is:

  1. Make Type Year Price Color Code
  2. 0 Tesla Model X 2015 85000 White TSLABG
  3. 1 Tesla Model X 2015 90000 White TSLABG
  4. 2 Tesla Model X 2015 95000 White TSLABG
  5. 3 Toyota Corolla 2017 20000 Red TYTA
  6. 4 Ford Bronco 2018 35000 Blue FRD_BG
  7. 5 Ford Bronco 2018 35000 Blue FRD_BG
  8. 6 Ford Mustang 2020 45000 Yellow FRD_BG
  9. 7 BMW 3 Series 2015 40000 Silver BMW
  10. 8 BMW 3 Series 2015 40000 Silver BMW
  11. 9 BMW 7 Series 2017 65000 Black BMW
  12. 10 Mercedes C-Class 2018 50000 White MercedesBG
  13. 11 Mercedes C-Class 2018 50000 White MercedesBG
  14. 12 Mercedes S-Class 2020 75000 Black MercedesBG
  15. 13 Jeep Wrangler 2020 60000 Grey JeepBG
  16. 14 Jeep Compass 2021 45000 Brown JeepBG
  17. 15 Jeep Patriot 2020 40000 Green JeepBG

答案1

得分: 1

Let's do this in steps, since there are many requirements:

First, let's define columns is_bg and isbg, which will aid us throughout the process:

  1. grouper = df.groupby('Make')['Code']
  2. df['is_bg'] = grouper.transform(lambda s: s.str.endswith('_BG').astype(bool))
  3. df['isbg'] = grouper.transform(lambda s: s.str.endswith('BG')
  4. & ~s.str.endswith('_BG').astype(bool))

Now, let's remove _BG and BG suffixes so that we can find the unique identifier for each Code:

  1. df.loc[df.isbg, 'Code'] = df.loc[df.isbg, 'Code'].str[:-2]
  2. df.loc[df.is_bg, 'Code'] = df.loc[df.is_bg, 'Code'].str[:-3]

Now, we create a mapper and assign:

  1. mapper = df.groupby('Make').apply(lambda s:
  2. # Retrieve the unique code
  3. s.loc
    展开收缩
    .str.len() > 1,'Code'].iloc[0] +
  4. # Append BG if `isbg`
  5. ('BG' if s.isbg.any() else
  6. # Otherwise, append _BG if `is_bg`
  7. ('_BG' if s.is_bg.any() else '')))
  8. df['New_Code'] = df['Make'].map(mapper)

Notice that there's an assumption here that for each Make, we'll only have one entry in Code that, after having _BG and BG removed, will NOT be either whitespace or None. If this is not true, then the problem is ambiguous, and you'll need to explore how to find the unique Code identifier.

英文:

Let's do this in steps, since there are many requirements

First, let's define columns is_bg and isbg, which will aid us throughout the process:

  1. grouper = df.groupby('Make')['Code']
  2. df['is_bg'] = grouper.transform(lambda s: s.str.endswith('_BG').astype(bool))
  3. df['isbg'] = grouper.transform(lambda s: s.str.endswith('BG')
  4. & ~s.str.endswith('_BG').astype(bool))

Now, let's remove _BG and BG suffixes so that we can find the unique identifier for each Code:

  1. df.loc[df.isbg, 'Code'] = df.loc[df.isbg, 'Code'].str[:-2]
  2. df.loc[df.is_bg, 'Code'] = df.loc[df.is_bg, 'Code'].str[:-3]

Now, we create a mapper and assign:

  1. mapper = df.groupby('Make').apply(lambda s:
  2. # Retrieve the unique code
  3. s.loc
    展开收缩
    .str.len() > 1,'Code'].iloc[0] +
  4. # Append BG if `isbg`
  5. ('BG' if s.isbg.any() else
  6. # Otherwise, append _BG if `is_bg`
  7. ('_BG' if s.is_bg.any() else '')))
  8. df['New_Code'] = df['Make'].map(mapper)

Notice that there's an assumption here that for each Make, we'll only have one entry in Code that, after having _BG and BG removed, will NOT be either whitespace or None. If this is not true, then the problem is ambiguous and you'll need to explore how to find the unique Code identifier.


  1. Make Type Year Price Code isbg is_bg New_Code
  2. 0 Tesla Model X 2015 85000 TSLA True False TSLABG
  3. 1 Tesla Model X 2015 90000 TSLA False False TSLABG
  4. 2 Tesla Model X 2015 95000 None False False TSLABG
  5. 3 Toyota Corolla 2017 20000 TYTA False False TYTA
  6. 4 Ford Bronco 2018 35000 FRD False False FRD_BG
  7. 5 Ford Bronco 2018 35000 False True FRD_BG
  8. 6 Ford Mustang 2020 45000 None False False FRD_BG
  9. 7 BMW 3 Series 2015 40000 None False False BMW
  10. 8 BMW 3 Series 2015 40000 BMW False False BMW
  11. 9 BMW 7 Series 2017 65000 BMW False False BMW
  12. 10 Mercedes C-Class 2018 50000 Mercedes True False MercedesBG
  13. 11 Mercedes C-Class 2018 50000 Mercedes False True MercedesBG
  14. 12 Mercedes S-Class 2020 75000 Mercedes True False MercedesBG
  15. 13 Jeep Wrangler 2020 60000 None False False JeepBG
  16. 14 Jeep Compass 2021 45000 False True JeepBG
  17. 15 Jeep Patriot 2020 40000 Jeep True False JeepBG

答案2

得分: 1

remove _BG and BG / fill value except BG

  1. s = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

s

  1. 0 TSLA
  2. 1 TSLA
  3. 2 TSLA
  4. 3 TYTA
  5. 4 FRD
  6. 5 FRD
  7. 6 FRD
  8. 7 BMW
  9. 8 BMW
  10. 9 BMW
  11. 10 Mercedes
  12. 11 Mercedes
  13. 12 Mercedes
  14. 13 Jeep
  15. 14 Jeep
  16. 15 Jeep

extract BG and _BG and so on..

  1. df1 = df['Code'].str.extract(r'[^_]+(BG)|(_BG)').groupby(df['Make']).ffill().groupby(df['Make']).bfill()

df1

  1. 0 1
  2. 0 BG NaN
  3. 1 BG NaN
  4. 2 BG NaN
  5. 3 NaN NaN
  6. 4 NaN _BG
  7. 5 NaN _BG
  8. 6 NaN _BG
  9. 7 NaN NaN
  10. 8 NaN NaN
  11. 9 NaN NaN
  12. 10 BG _BG
  13. 11 BG _BG
  14. 12 BG _BG
  15. 13 BG _BG
  16. 14 BG _BG
  17. 15 BG _BG

fillna df1 and string concat

  1. df.assign(Code=s.str.cat(df1[0].fillna(df1[1]).fillna('')))

out

  1. Make Type Year Price Color Code
  2. 0 Tesla Model X 2015 85000 White TSLABG
  3. 1 Tesla Model X 2015 90000 White TSLABG
  4. 2 Tesla Model X 2015 95000 White TSLABG
  5. 3 Toyota Corolla 2017 20000 Red TYTA
  6. 4 Ford Bronco 2018 35000 Blue FRD_BG
  7. 5 Ford Bronco 2018 35000 Blue FRD_BG
  8. 6 Ford Mustang 2020 45000 Yellow FRD_BG
  9. 7 BMW 3 Series 2015 40000 Silver BMW
  10. 8 BMW 3 Series 2015 40000 Silver BMW
  11. 9 BMW 7 Series 2017 65000 Black BMW
  12. 10 Mercedes C-Class 2018 50000 White MercedesBG
  13. 11 Mercedes C-Class 2018 50000 White MercedesBG
  14. 12 Mercedes S-Class 2020 75000 Black MercedesBG
  15. 13 Jeep Wrangler 2020 60000 Grey JeepBG
  16. 14 Jeep Compass 2021 45000 Brown JeepBG
  17. 15 Jeep Patriot 2020 40000 Green JeepBG
英文:

remove _BG and BG / fill value except BG

  1. s = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')

s

  1. 0 TSLA
  2. 1 TSLA
  3. 2 TSLA
  4. 3 TYTA
  5. 4 FRD
  6. 5 FRD
  7. 6 FRD
  8. 7 BMW
  9. 8 BMW
  10. 9 BMW
  11. 10 Mercedes
  12. 11 Mercedes
  13. 12 Mercedes
  14. 13 Jeep
  15. 14 Jeep
  16. 15 Jeep

extract BG and _BG and so on..

  1. df1 = df['Code'].str.extract(r'[^_]+(BG)|(_BG)').groupby(df['Make']).ffill().groupby(df['Make']).bfill()

df1

  1. 0 1
  2. 0 BG NaN
  3. 1 BG NaN
  4. 2 BG NaN
  5. 3 NaN NaN
  6. 4 NaN _BG
  7. 5 NaN _BG
  8. 6 NaN _BG
  9. 7 NaN NaN
  10. 8 NaN NaN
  11. 9 NaN NaN
  12. 10 BG _BG
  13. 11 BG _BG
  14. 12 BG _BG
  15. 13 BG _BG
  16. 14 BG _BG
  17. 15 BG _BG

fillna df1 and string concat

  1. df.assign(Code=s.str.cat(df1[0].fillna(df1[1]).fillna('')))

out

  1. Make Type Year Price Color Code
  2. 0 Tesla Model X 2015 85000 White TSLABG
  3. 1 Tesla Model X 2015 90000 White TSLABG
  4. 2 Tesla Model X 2015 95000 White TSLABG
  5. 3 Toyota Corolla 2017 20000 Red TYTA
  6. 4 Ford Bronco 2018 35000 Blue FRD_BG
  7. 5 Ford Bronco 2018 35000 Blue FRD_BG
  8. 6 Ford Mustang 2020 45000 Yellow FRD_BG
  9. 7 BMW 3 Series 2015 40000 Silver BMW
  10. 8 BMW 3 Series 2015 40000 Silver BMW
  11. 9 BMW 7 Series 2017 65000 Black BMW
  12. 10 Mercedes C-Class 2018 50000 White MercedesBG
  13. 11 Mercedes C-Class 2018 50000 White MercedesBG
  14. 12 Mercedes S-Class 2020 75000 Black MercedesBG
  15. 13 Jeep Wrangler 2020 60000 Grey JeepBG
  16. 14 Jeep Compass 2021 45000 Brown JeepBG
  17. 15 Jeep Patriot 2020 40000 Green JeepBG

答案3

得分: 0

Building off of Panda Kim's solution:

  1. lhs = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')
  2. rhs = df['Code'].str.extract(r'[^_]+(BG)|(_BG)', expand=False).groupby(df['Make']).ffill().groupby(df['Make']).bfill()
  3. rhs = rhs[0].fillna((rhs[1]).fillna(''))
  4. df['Code'] = lhs + rhs
英文:

Building off of Panda Kim's solution

  1. lhs = df['Code'].str.replace(r'[_]?BG', '', regex=True).replace({'': None}).groupby(df['Make']).transform('first')
  2. rhs = df['Code'].str.extract(r'[^_]+(BG)|(_BG)',expand =False).groupby(df['Make']).ffill().groupby(df['Make']).bfill()
  3. rhs = rhs[0].fillna((rhs[1]).fillna(''))
  4. df['Code'] = lhs + rhs

huangapple
  • 本文由 发表于 2023年5月10日 23:32:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220259.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定