在Pandas中对数据转换进行矢量化或加速循环

huangapple go评论71阅读模式
英文:

Vectorizing or speed up for loop in Pandas for data transformation

问题

我有一个以以下格式的数据帧:

df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
                   'Parent_age': [38, None, 40, 26],
                   'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
                   'Child1_age': [2, None, 1, 2],
                   'Child2_name': ['Pat', 'Dennis', None, None],
                   'Child2_age': [4, None, None, None]})

如上所示,每一行对应一个父母(唯一ID),每个父母可以有多个子女。子女可以有很多属性,但在这个示例中,我只有两个属性(姓名、年龄)。子女属性列遵循相同的约定。

我想将其转换为这样:

df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
                    'Child_number': [1, 2, 1, 2, 1, 1],
                    'Child_age': [2, 4, None, None, 1, 2],
                    'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
                    'Parent_age': [38, 38, None, None, 40, 26]})

每一行对应一个子女,Child_number表示是否是第一个子女或第二个子女等。

为了加快速度,我通过创建一个正确大小的空数据帧来预先分配df2的空间,而不是使用连接操作。我首先遍历df1,计算每个父母有多少子女,以获得df2所需的行数。

然后,我构建了索引的字典,将每个子女/父母映射到df2中的行/行。我认为由于字典查找速度快,这比每次使用where()来查找df2中的行要好。同样,使用for循环来完成这个任务。

这些步骤实际上不需要很长时间。但是使用for循环将数据从df复制到df2实际上需要很长时间:

for index in df.index:
    for col in df.columns:
        // 将df.loc[index, col]复制到df2中相应位置,使用dataframe.loc

我真的希望有一种更快的方法来做这个。我不太理解向量化,也不确定它是否适用于字符串列。

请建议。谢谢!

英文:

I have a dataframe in the following format:

df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
                   'Parent_age': [38, None, 40, 26],
                   'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
                   'Child1_age': [2, None, 1, 2],
                   'Child2_name': ['Pat', 'Dennis', None, None],
                   'Child2_age': [4, None, None, None]}) 

  Parent_username  Parent_age Child1_name  Child1_age Child2_name  Child2_age
0            Bob1        38.0        Mike         2.0         Pat         4.0
1           Ron23         NaN        John         NaN      Dennis         NaN
2          Lisa00        40.0        Curt         1.0        None         NaN
3            Joe_        26.0       Kelly         2.0        None         NaN

As you can see above, each row corresponds to a parent (unique ID), and each parents can have multiple children. There can be many children but I have 2 listed, and each child can have many attributes, but I only have 2 (name, age) in this example. The child attribute columns follow the same convention.

I'd like to transform it to such:

df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
                    'Child_number': [1, 2, 1, 2, 1, 1],
                    'Child_age': [2, 4, None, None, 1, 2],
                    'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
                    'Parent_age': [38, 38, None, None, 40, 26]})

  Child_name  Child_number  Child_age Parent_username  Parent_age
0       Mike             1        2.0            Bob1        38.0
1        Pat             2        4.0            Bob1        38.0
2       John             1        NaN           Ron23         NaN
3     Dennis             2        NaN           Ron23         NaN
4       Curt             1        1.0          Lisa00        40.0
5      Kelly             1        2.0            Joe_        26.0

Each row corresponds to a child, and the Child_number indicates if it's the first child or second child, etc.

In order to speed things up, I pre-allocated space for df2 by making an empty dataframe of the right size, rather than doing concatenation. I first looped through df1 by counting how many children each parent has, to get the number of rows needed for df2.

Then, I built dictionaries of indexes to map each child/parent to its row/rows in df2. I figure since dictionary lookup is fast, this is better than trying to find the row in df2 each time using where(). Again, a for loop was used for this.

Those actually does not take long. But the actual copying of the data from df to df2 takes a long time using for loop:

for index in df.index:
    for col in df.columns:
        // copy df.loc[index, col] into the corresponding position in df2 using dataframe.loc

I'm really hoping there is a faster way to do this. I don't understand vectorization very well and I'm not sure if it works well for string columns.

Please advise.
Thanks

答案1

得分: 2

你的代码运行缓慢,因为你逐个元素处理它们。你可以通过逐列处理来加快速度。下面的代码找到所有子名称列,找到它们有值(即非空)的索引,然后一次性处理所有这些字段。

我还添加了一种方法,可以提前列出所有属性,这样你就不必手动逐个重命名它们了。

cnames = [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
cattrs = ['_name', '_age']
newnames = ['Child' + i for i in cattrs]
dflist = []

for childcol in cnames:
    cid = childcol.split('_')[0]
    cnum = int(cid[-1])
    attrs = [cid + i for i in cattrs]  # 获取所有属性
    attrs.extend(['Parent_username', 'Parent_age'])

    cdf = df.loc[df[childcol].dropna().index, attrs]
    cdf['Child_number'] = cnum

    cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
    dflist.append(cdf)

newdf = pd.concat(dflist)
newdf = newdf.reset_index(drop=True)
英文:

Your code is slow because you are working on each element one at a time. You can speed it up by working on a column at a time. The code below finds all child name columns, finds the indexes where they have a value (i.e. not null) and operates on all those fields at once.

I also added ways to list all the attributes ahead of time so that you don't have to manually rename them individually.

cnames =  [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
cattrs = ['_name', '_age']
newnames = ['Child' + i for i in cattrs]
dflist = []


for childcol in cnames:
    cid = childcol.split('_')[0]
    cnum = int(cid[-1])
    attrs  = [cid + i for i in cattrs] # get all the attributes
    attrs.extend(['Parent_username', 'Parent_age'])
    
    cdf = df.loc[df[childcol].dropna().index, attrs]
    cdf['Child_number'] = cnum
    
    cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
    dflist.append(cdf)
    
newdf = pd.concat(dflist)
newdf = newdf.reset_index(drop=True)

huangapple
  • 本文由 发表于 2023年7月14日 00:41:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681632.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定