英文:
Vectorizing or speed up for loop in Pandas for data transformation
问题
我有一个以以下格式的数据帧:
df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
'Parent_age': [38, None, 40, 26],
'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
'Child1_age': [2, None, 1, 2],
'Child2_name': ['Pat', 'Dennis', None, None],
'Child2_age': [4, None, None, None]})
如上所示,每一行对应一个父母(唯一ID),每个父母可以有多个子女。子女可以有很多属性,但在这个示例中,我只有两个属性(姓名、年龄)。子女属性列遵循相同的约定。
我想将其转换为这样:
df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
'Child_number': [1, 2, 1, 2, 1, 1],
'Child_age': [2, 4, None, None, 1, 2],
'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
'Parent_age': [38, 38, None, None, 40, 26]})
每一行对应一个子女,Child_number表示是否是第一个子女或第二个子女等。
为了加快速度,我通过创建一个正确大小的空数据帧来预先分配df2的空间,而不是使用连接操作。我首先遍历df1,计算每个父母有多少子女,以获得df2所需的行数。
然后,我构建了索引的字典,将每个子女/父母映射到df2中的行/行。我认为由于字典查找速度快,这比每次使用where()
来查找df2中的行要好。同样,使用for循环来完成这个任务。
这些步骤实际上不需要很长时间。但是使用for循环将数据从df复制到df2实际上需要很长时间:
for index in df.index:
for col in df.columns:
// 将df.loc[index, col]复制到df2中相应位置,使用dataframe.loc
我真的希望有一种更快的方法来做这个。我不太理解向量化,也不确定它是否适用于字符串列。
请建议。谢谢!
英文:
I have a dataframe in the following format:
df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
'Parent_age': [38, None, 40, 26],
'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
'Child1_age': [2, None, 1, 2],
'Child2_name': ['Pat', 'Dennis', None, None],
'Child2_age': [4, None, None, None]})
Parent_username Parent_age Child1_name Child1_age Child2_name Child2_age
0 Bob1 38.0 Mike 2.0 Pat 4.0
1 Ron23 NaN John NaN Dennis NaN
2 Lisa00 40.0 Curt 1.0 None NaN
3 Joe_ 26.0 Kelly 2.0 None NaN
As you can see above, each row corresponds to a parent (unique ID), and each parents can have multiple children. There can be many children but I have 2 listed, and each child can have many attributes, but I only have 2 (name, age) in this example. The child attribute columns follow the same convention.
I'd like to transform it to such:
df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
'Child_number': [1, 2, 1, 2, 1, 1],
'Child_age': [2, 4, None, None, 1, 2],
'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
'Parent_age': [38, 38, None, None, 40, 26]})
Child_name Child_number Child_age Parent_username Parent_age
0 Mike 1 2.0 Bob1 38.0
1 Pat 2 4.0 Bob1 38.0
2 John 1 NaN Ron23 NaN
3 Dennis 2 NaN Ron23 NaN
4 Curt 1 1.0 Lisa00 40.0
5 Kelly 1 2.0 Joe_ 26.0
Each row corresponds to a child, and the Child_number indicates if it's the first child or second child, etc.
In order to speed things up, I pre-allocated space for df2 by making an empty dataframe of the right size, rather than doing concatenation. I first looped through df1 by counting how many children each parent has, to get the number of rows needed for df2.
Then, I built dictionaries of indexes to map each child/parent to its row/rows in df2. I figure since dictionary lookup is fast, this is better than trying to find the row in df2 each time using where(). Again, a for loop was used for this.
Those actually does not take long. But the actual copying of the data from df to df2 takes a long time using for loop:
for index in df.index:
for col in df.columns:
// copy df.loc[index, col] into the corresponding position in df2 using dataframe.loc
I'm really hoping there is a faster way to do this. I don't understand vectorization very well and I'm not sure if it works well for string columns.
Please advise.
Thanks
答案1
得分: 2
你的代码运行缓慢,因为你逐个元素处理它们。你可以通过逐列处理来加快速度。下面的代码找到所有子名称列,找到它们有值(即非空)的索引,然后一次性处理所有这些字段。
我还添加了一种方法,可以提前列出所有属性,这样你就不必手动逐个重命名它们了。
cnames = [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
cattrs = ['_name', '_age']
newnames = ['Child' + i for i in cattrs]
dflist = []
for childcol in cnames:
cid = childcol.split('_')[0]
cnum = int(cid[-1])
attrs = [cid + i for i in cattrs] # 获取所有属性
attrs.extend(['Parent_username', 'Parent_age'])
cdf = df.loc[df[childcol].dropna().index, attrs]
cdf['Child_number'] = cnum
cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
dflist.append(cdf)
newdf = pd.concat(dflist)
newdf = newdf.reset_index(drop=True)
英文:
Your code is slow because you are working on each element one at a time. You can speed it up by working on a column at a time. The code below finds all child name columns, finds the indexes where they have a value (i.e. not null) and operates on all those fields at once.
I also added ways to list all the attributes ahead of time so that you don't have to manually rename them individually.
cnames = [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
cattrs = ['_name', '_age']
newnames = ['Child' + i for i in cattrs]
dflist = []
for childcol in cnames:
cid = childcol.split('_')[0]
cnum = int(cid[-1])
attrs = [cid + i for i in cattrs] # get all the attributes
attrs.extend(['Parent_username', 'Parent_age'])
cdf = df.loc[df[childcol].dropna().index, attrs]
cdf['Child_number'] = cnum
cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
dflist.append(cdf)
newdf = pd.concat(dflist)
newdf = newdf.reset_index(drop=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论