英文:
Better way to duplicate rows based on two columns, merging those columns into a single column
问题
我有以下的Pandas数据框(DF)...
即。
,resultset_id,resultsetrevision_id,injection_id,injection_acqmethod_id,injection_damethod_id
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383
为了以后合并,最好有四行(而不是两行),并且injection_acqmethod_id
和injection_damethod_id
都放在method_id
列中,如下所示...
即。
,resultset_id,resultsetrevision_id,injection_id,injection_acqmethod_id,injection_damethod_id,method_id
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,,None
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383,53b34ff9-fec2-472d-a4d0-61e6029d586a
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383,cd4cbbd9-5f23-4146-a499-9c90e3c73383
我正在使用以下代码...
_injections["method_id"] = (_injections.injection_acqmethod_id.astype(str) + "," + _injections.injection_damethod_id.astype(str)).str.split(",")
_injections = _injections.explode("method_id")
将列合并成列表,然后再展开似乎是不必要的工作。有没有更Pythonic/更快/更简洁的方法来做到这一点?
英文:
I have the following Padas DF...
I.e.
,resultset_id,resultsetrevision_id,injection_id,injection_acqmethod_id,injection_damethod_id
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383
In order for a later merge, it would be better if I had four rows (instead of two) and injection_acqmethod_id
and injection_damethod_id
were both simply in column method_id
as follows...
I.e.
,resultset_id,resultsetrevision_id,injection_id,injection_acqmethod_id,injection_damethod_id,method_id
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,,None
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383,53b34ff9-fec2-472d-a4d0-61e6029d586a
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383,cd4cbbd9-5f23-4146-a499-9c90e3c73383
I'm using the following code...
_injections["method_id"] = (_injections.injection_acqmethod_id.astype(str) + "," + _injections.injection_damethod_id.astype(str)).str.split(",")
_injections = _injections.explode("method_id")
Merging the columns into a list and then exploding seems like unnecessary work. Is tere a more pythonic/faster/more-concise way to do this?
答案1
得分: 1
One idea with reshape:
cols = ['injection_acqmethod_id', 'injection_damethod_id']
out = (df.assign(**{f'_{x}_': df[x] for x in cols})
.set_index(list(df.columns))
.stack()
.droplevel(-1)
.reset_index(name='method_id'))
print(out)
Another idea with numpy ravel:
cols = ['injection_acqmethod_id', 'injection_damethod_id']
out = df.loc[df.index.repeat(len(cols))].assign(method_id=np.ravel(df[cols].to_numpy()))
Or with concat
:
cols = ['injection_acqmethod_id', 'injection_damethod_id']
out = (pd.concat([df.assign(method_id=df[x]) for x in cols])
.sort_index(kind='stable', ignore_index=True))
英文:
One idea with reshape:
cols = ['injection_acqmethod_id','injection_damethod_id']
out = (df.assign(**{f'_{x}_': df[x] for x in cols})
.set_index(list(df.columns))
.stack()
.droplevel(-1)
.reset_index(name='method_id'))
print (out)
Another idea with numpy ravel:
cols = ['injection_acqmethod_id','injection_damethod_id']
out = df.loc[df.index.repeat(len(cols))].assign(method_id = np.ravel(df[cols].to_numpy()))
Or with concat
:
cols = ['injection_acqmethod_id','injection_damethod_id']
out = (pd.concat([df.assign(method_id=df[x]) for x in cols])
.sort_index(kind='stable', ignore_index=True))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论