在Pandas中对数据转换进行矢量化或加速循环

huangapple go评论98阅读模式
英文:

Vectorizing or speed up for loop in Pandas for data transformation

问题

我有一个以以下格式的数据帧:

  1. df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
  2. 'Parent_age': [38, None, 40, 26],
  3. 'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
  4. 'Child1_age': [2, None, 1, 2],
  5. 'Child2_name': ['Pat', 'Dennis', None, None],
  6. 'Child2_age': [4, None, None, None]})

如上所示,每一行对应一个父母(唯一ID),每个父母可以有多个子女。子女可以有很多属性,但在这个示例中,我只有两个属性(姓名、年龄)。子女属性列遵循相同的约定。

我想将其转换为这样:

  1. df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
  2. 'Child_number': [1, 2, 1, 2, 1, 1],
  3. 'Child_age': [2, 4, None, None, 1, 2],
  4. 'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
  5. 'Parent_age': [38, 38, None, None, 40, 26]})

每一行对应一个子女,Child_number表示是否是第一个子女或第二个子女等。

为了加快速度,我通过创建一个正确大小的空数据帧来预先分配df2的空间,而不是使用连接操作。我首先遍历df1,计算每个父母有多少子女,以获得df2所需的行数。

然后,我构建了索引的字典,将每个子女/父母映射到df2中的行/行。我认为由于字典查找速度快,这比每次使用where()来查找df2中的行要好。同样,使用for循环来完成这个任务。

这些步骤实际上不需要很长时间。但是使用for循环将数据从df复制到df2实际上需要很长时间:

  1. for index in df.index:
  2. for col in df.columns:
  3. // 将df.loc[index, col]复制到df2中相应位置,使用dataframe.loc

我真的希望有一种更快的方法来做这个。我不太理解向量化,也不确定它是否适用于字符串列。

请建议。谢谢!

英文:

I have a dataframe in the following format:

  1. df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
  2. 'Parent_age': [38, None, 40, 26],
  3. 'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
  4. 'Child1_age': [2, None, 1, 2],
  5. 'Child2_name': ['Pat', 'Dennis', None, None],
  6. 'Child2_age': [4, None, None, None]})
  7. Parent_username Parent_age Child1_name Child1_age Child2_name Child2_age
  8. 0 Bob1 38.0 Mike 2.0 Pat 4.0
  9. 1 Ron23 NaN John NaN Dennis NaN
  10. 2 Lisa00 40.0 Curt 1.0 None NaN
  11. 3 Joe_ 26.0 Kelly 2.0 None NaN

As you can see above, each row corresponds to a parent (unique ID), and each parents can have multiple children. There can be many children but I have 2 listed, and each child can have many attributes, but I only have 2 (name, age) in this example. The child attribute columns follow the same convention.

I'd like to transform it to such:

  1. df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
  2. 'Child_number': [1, 2, 1, 2, 1, 1],
  3. 'Child_age': [2, 4, None, None, 1, 2],
  4. 'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
  5. 'Parent_age': [38, 38, None, None, 40, 26]})
  6. Child_name Child_number Child_age Parent_username Parent_age
  7. 0 Mike 1 2.0 Bob1 38.0
  8. 1 Pat 2 4.0 Bob1 38.0
  9. 2 John 1 NaN Ron23 NaN
  10. 3 Dennis 2 NaN Ron23 NaN
  11. 4 Curt 1 1.0 Lisa00 40.0
  12. 5 Kelly 1 2.0 Joe_ 26.0

Each row corresponds to a child, and the Child_number indicates if it's the first child or second child, etc.

In order to speed things up, I pre-allocated space for df2 by making an empty dataframe of the right size, rather than doing concatenation. I first looped through df1 by counting how many children each parent has, to get the number of rows needed for df2.

Then, I built dictionaries of indexes to map each child/parent to its row/rows in df2. I figure since dictionary lookup is fast, this is better than trying to find the row in df2 each time using where(). Again, a for loop was used for this.

Those actually does not take long. But the actual copying of the data from df to df2 takes a long time using for loop:

  1. for index in df.index:
  2. for col in df.columns:
  3. // copy df.loc[index, col] into the corresponding position in df2 using dataframe.loc

I'm really hoping there is a faster way to do this. I don't understand vectorization very well and I'm not sure if it works well for string columns.

Please advise.
Thanks

答案1

得分: 2

你的代码运行缓慢,因为你逐个元素处理它们。你可以通过逐列处理来加快速度。下面的代码找到所有子名称列,找到它们有值(即非空)的索引,然后一次性处理所有这些字段。

我还添加了一种方法,可以提前列出所有属性,这样你就不必手动逐个重命名它们了。

  1. cnames = [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
  2. cattrs = ['_name', '_age']
  3. newnames = ['Child' + i for i in cattrs]
  4. dflist = []
  5. for childcol in cnames:
  6. cid = childcol.split('_')[0]
  7. cnum = int(cid[-1])
  8. attrs = [cid + i for i in cattrs] # 获取所有属性
  9. attrs.extend(['Parent_username', 'Parent_age'])
  10. cdf = df.loc[df[childcol].dropna().index, attrs]
  11. cdf['Child_number'] = cnum
  12. cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
  13. dflist.append(cdf)
  14. newdf = pd.concat(dflist)
  15. newdf = newdf.reset_index(drop=True)
英文:

Your code is slow because you are working on each element one at a time. You can speed it up by working on a column at a time. The code below finds all child name columns, finds the indexes where they have a value (i.e. not null) and operates on all those fields at once.

I also added ways to list all the attributes ahead of time so that you don't have to manually rename them individually.

  1. cnames = [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
  2. cattrs = ['_name', '_age']
  3. newnames = ['Child' + i for i in cattrs]
  4. dflist = []
  5. for childcol in cnames:
  6. cid = childcol.split('_')[0]
  7. cnum = int(cid[-1])
  8. attrs = [cid + i for i in cattrs] # get all the attributes
  9. attrs.extend(['Parent_username', 'Parent_age'])
  10. cdf = df.loc[df[childcol].dropna().index, attrs]
  11. cdf['Child_number'] = cnum
  12. cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
  13. dflist.append(cdf)
  14. newdf = pd.concat(dflist)
  15. newdf = newdf.reset_index(drop=True)

huangapple
  • 本文由 发表于 2023年7月14日 00:41:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681632.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定