解析数据帧中由已知分隔符分隔的列,然后重新整理它。

huangapple go评论65阅读模式
英文:

Parse a column in data frame separated by known delimiter the reshape it

问题

我有一个包含以下列的汽车数据框。

Car_codecar_modelupdate_Datesensor_codes,每辆车在sensor_codes中列出多个传感器,以斜杠 / 分隔,如下所示:

df = pd.DataFrame([['x','iii-2019-10-16','18/04/2019','115/556/879/115'],
                   ['x','iii-2019-10-16','21/07/2019','87/998/115'],
                   ['x','iii-2019-10-16','','115/556/879'],
                   ['x','zzz-2020-10-25','12/04/2022',''],
                   ['y','qqq-2018-05-28','10/12/2017','789/554/745'], 
                   ['y','qqq-2018-05-28','15/02/2018','789/554/75'],
                   ['y','ooo-2019-11-22','30/05/2019','55'],
                   ['y','rrr-16-12-2020','16/12/2020',''],
                   ['z','ppt-2019-12-03','07/02/2018','889/654/750'],
                   ['z','ttt-2019-12-03','28/05/2019','119/55/75'],
                   ['z','ttt-2019-12-03','09/09/2019'],
                   ['z','ttt-2019-12-03','30/09/2019']
                  ],
                  columns=['Car_code','car_model','update_Date','sensor_codes'])
df

我需要创建一个新的数据框,只包含两列:Car_code 和传感器代码(其中包含唯一的传感器)。所以对于每个 car_code,将有多行,每一行都有一个传感器,如下所示:

英文:

I have a data frame for the cars with these columns.

Car_code, car_model, update_Date, sensor_codes each car has multiple sensors listed in sensor_codes and separated by / as below:

df = pd.DataFrame([['x','iii-2019-10-16','18/04/2019','115/556/879/115'],
                   ['x','iii-2019-10-16','21/07/2019','87/998/115'],
                   ['x','iii-2019-10-16','','115/556/879'],
                   ['x','zzz-2020-10-25','12/04/2022',''],
                   ['y','qqq-2018-05-28','10/12/2017','789/554/745'], 
                   ['y','qqq-2018-05-28','15/02/2018','789/554/75'],
                   ['y','ooo-2019-11-22','30/05/2019','55'],
                   ['y','rrr-16-12-2020','16/12/2020',''],
                   ['z','ppt-2019-12-03','07/02/2018','889/654/750'],
                   ['z','ttt-2019-12-03','28/05/2019','119/55/75'],
                   ['z','ttt-2019-12-03','09/09/2019'],
                   ['z','ttt-2019-12-03','30/09/2019']
                  
                  ],
                  columns=['Car_code','car_model','update_Date','sensor_codes'])
df

解析数据帧中由已知分隔符分隔的列,然后重新整理它。

I need to create a new data frame that has just two columns: Car_code, and sensor code (which will contain the unique sensor) so for each car_code there will be multiple rows each one has one sensor like below:

解析数据帧中由已知分隔符分隔的列,然后重新整理它。

答案1

得分: 0

以下是代码的中文翻译部分:

df_result = (
    df[['Car_code']]
    .assign(sensor_codes=df['sensor_codes'].str.split('/'))
    .explode('sensor_codes')
    .loc[lambda df:
         df['sensor_codes'].notna() & df['sensor_codes'].str.strip().ne('')
    ].drop_duplicates(keep='first')
    .assign(sensor_codes=lambda df: df['sensor_codes'].astype('int'))
    .reset_index(drop=True)
)

代码的翻译完成。

英文:

You could try:

df_result = (
    df[['Car_code']]
    .assign(sensor_codes=df['sensor_codes'].str.split('/'))
    .explode('sensor_codes')
    .loc[lambda df:
         df['sensor_codes'].notna() & df['sensor_codes'].str.strip().ne('')
    ].drop_duplicates(keep='first')
    .assign(sensor_codes=lambda df: df['sensor_codes'].astype('int'))
    .reset_index(drop=True)
)
  • Pick the column Car_code as a dataframe (therefore the double brackets) and add sensor_codes as new column, but use .str.split('/') to split the items along '/' into lists.
  • Then .explode the sensor_codes column to flatten it, which keeps the respective items in Car_code.
  • Afterwards use .loc to filter out the rows that have either a NaN (here None) or an empty string in sensor_codes (the .str.strip is just in case there's only whitespace in an item).
  • Then drop duplicate rows except for the first ones.
  • Finally cast sensor_codes into integers and reset the index (both optional, might well be that you don't need it).

Result for the sample:

   Car_code sensor_codes
0         x          115
1         x          556
2         x          879
3         x           87
4         x          998
5         y          789
6         y          554
7         y          745
8         y           75
9         y           55
10        z          889
11        z          654
12        z          750
13        z          119
14        z           55
15        z           75

huangapple
  • 本文由 发表于 2023年6月8日 21:03:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76432142.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定