英文:
Parse a column in data frame separated by known delimiter the reshape it
问题
我有一个包含以下列的汽车数据框。
Car_code
、car_model
、update_Date
、sensor_codes
,每辆车在sensor_codes
中列出多个传感器,以斜杠 /
分隔,如下所示:
df = pd.DataFrame([['x','iii-2019-10-16','18/04/2019','115/556/879/115'],
['x','iii-2019-10-16','21/07/2019','87/998/115'],
['x','iii-2019-10-16','','115/556/879'],
['x','zzz-2020-10-25','12/04/2022',''],
['y','qqq-2018-05-28','10/12/2017','789/554/745'],
['y','qqq-2018-05-28','15/02/2018','789/554/75'],
['y','ooo-2019-11-22','30/05/2019','55'],
['y','rrr-16-12-2020','16/12/2020',''],
['z','ppt-2019-12-03','07/02/2018','889/654/750'],
['z','ttt-2019-12-03','28/05/2019','119/55/75'],
['z','ttt-2019-12-03','09/09/2019'],
['z','ttt-2019-12-03','30/09/2019']
],
columns=['Car_code','car_model','update_Date','sensor_codes'])
df
我需要创建一个新的数据框,只包含两列:Car_code
和传感器代码(其中包含唯一的传感器)。所以对于每个 car_code
,将有多行,每一行都有一个传感器,如下所示:
英文:
I have a data frame for the cars with these columns.
Car_code
, car_model
, update_Date
, sensor_codes
each car has multiple sensors listed in sensor_codes
and separated by /
as below:
df = pd.DataFrame([['x','iii-2019-10-16','18/04/2019','115/556/879/115'],
['x','iii-2019-10-16','21/07/2019','87/998/115'],
['x','iii-2019-10-16','','115/556/879'],
['x','zzz-2020-10-25','12/04/2022',''],
['y','qqq-2018-05-28','10/12/2017','789/554/745'],
['y','qqq-2018-05-28','15/02/2018','789/554/75'],
['y','ooo-2019-11-22','30/05/2019','55'],
['y','rrr-16-12-2020','16/12/2020',''],
['z','ppt-2019-12-03','07/02/2018','889/654/750'],
['z','ttt-2019-12-03','28/05/2019','119/55/75'],
['z','ttt-2019-12-03','09/09/2019'],
['z','ttt-2019-12-03','30/09/2019']
],
columns=['Car_code','car_model','update_Date','sensor_codes'])
df
I need to create a new data frame that has just two columns: Car_code
, and sensor code (which will contain the unique sensor) so for each car_code
there will be multiple rows each one has one sensor like below:
答案1
得分: 0
以下是代码的中文翻译部分:
df_result = (
df[['Car_code']]
.assign(sensor_codes=df['sensor_codes'].str.split('/'))
.explode('sensor_codes')
.loc[lambda df:
df['sensor_codes'].notna() & df['sensor_codes'].str.strip().ne('')
].drop_duplicates(keep='first')
.assign(sensor_codes=lambda df: df['sensor_codes'].astype('int'))
.reset_index(drop=True)
)
代码的翻译完成。
英文:
You could try:
df_result = (
df[['Car_code']]
.assign(sensor_codes=df['sensor_codes'].str.split('/'))
.explode('sensor_codes')
.loc[lambda df:
df['sensor_codes'].notna() & df['sensor_codes'].str.strip().ne('')
].drop_duplicates(keep='first')
.assign(sensor_codes=lambda df: df['sensor_codes'].astype('int'))
.reset_index(drop=True)
)
- Pick the column
Car_code
as a dataframe (therefore the double brackets) and addsensor_codes
as new column, but use.str.split('/')
to split the items along'/'
into lists. - Then
.explode
thesensor_codes
column to flatten it, which keeps the respective items inCar_code
. - Afterwards use
.loc
to filter out the rows that have either aNaN
(hereNone
) or an empty string insensor_codes
(the.str.strip
is just in case there's only whitespace in an item). - Then drop duplicate rows except for the first ones.
- Finally cast
sensor_codes
into integers and reset the index (both optional, might well be that you don't need it).
Result for the sample:
Car_code sensor_codes
0 x 115
1 x 556
2 x 879
3 x 87
4 x 998
5 y 789
6 y 554
7 y 745
8 y 75
9 y 55
10 z 889
11 z 654
12 z 750
13 z 119
14 z 55
15 z 75
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论