2023年7月14日 00:41:15go评论99阅读模式

英文:

Vectorizing or speed up for loop in Pandas for data transformation

问题

我有一个以以下格式的数据帧：

df = pd.DataFrame({'Parent_username': ['Bob1', 'Ron23', 'Lisa00', 'Joe_'],
                   'Parent_age': [38, None, 40, 26],
                   'Child1_name': ['Mike', 'John', 'Curt', 'Kelly'],
                   'Child1_age': [2, None, 1, 2],
                   'Child2_name': ['Pat', 'Dennis', None, None],
                   'Child2_age': [4, None, None, None]})

如上所示，每一行对应一个父母（唯一ID），每个父母可以有多个子女。子女可以有很多属性，但在这个示例中，我只有两个属性（姓名、年龄）。子女属性列遵循相同的约定。

我想将其转换为这样：

df2 = pd.DataFrame({'Child_name': ['Mike', 'Pat', 'John', 'Dennis', 'Curt', 'Kelly'],
                    'Child_number': [1, 2, 1, 2, 1, 1],
                    'Child_age': [2, 4, None, None, 1, 2],
                    'Parent_username': ['Bob1', 'Bob1', 'Ron23', 'Ron23', 'Lisa00', 'Joe_'],
                    'Parent_age': [38, 38, None, None, 40, 26]})

每一行对应一个子女，Child_number表示是否是第一个子女或第二个子女等。

为了加快速度，我通过创建一个正确大小的空数据帧来预先分配df2的空间，而不是使用连接操作。我首先遍历df1，计算每个父母有多少子女，以获得df2所需的行数。

然后，我构建了索引的字典，将每个子女/父母映射到df2中的行/行。我认为由于字典查找速度快，这比每次使用where()来查找df2中的行要好。同样，使用for循环来完成这个任务。

这些步骤实际上不需要很长时间。但是使用for循环将数据从df复制到df2实际上需要很长时间：

for index in df.index:
    for col in df.columns:
        // 将df.loc[index, col]复制到df2中相应位置，使用dataframe.loc

我真的希望有一种更快的方法来做这个。我不太理解向量化，也不确定它是否适用于字符串列。

请建议。谢谢！

英文:

I have a dataframe in the following format:

df = pd.DataFrame({&#39;Parent_username&#39;: [&#39;Bob1&#39;, &#39;Ron23&#39;, &#39;Lisa00&#39;, &#39;Joe_&#39;],
                   &#39;Parent_age&#39;: [38, None, 40, 26],
                   &#39;Child1_name&#39;: [&#39;Mike&#39;, &#39;John&#39;, &#39;Curt&#39;, &#39;Kelly&#39;],
                   &#39;Child1_age&#39;: [2, None, 1, 2],
                   &#39;Child2_name&#39;: [&#39;Pat&#39;, &#39;Dennis&#39;, None, None],
                   &#39;Child2_age&#39;: [4, None, None, None]}) 
  Parent_username  Parent_age Child1_name  Child1_age Child2_name  Child2_age
0            Bob1        38.0        Mike         2.0         Pat         4.0
1           Ron23         NaN        John         NaN      Dennis         NaN
2          Lisa00        40.0        Curt         1.0        None         NaN
3            Joe_        26.0       Kelly         2.0        None         NaN

As you can see above, each row corresponds to a parent (unique ID), and each parents can have multiple children. There can be many children but I have 2 listed, and each child can have many attributes, but I only have 2 (name, age) in this example. The child attribute columns follow the same convention.

I'd like to transform it to such:

df2 = pd.DataFrame({&#39;Child_name&#39;: [&#39;Mike&#39;, &#39;Pat&#39;, &#39;John&#39;, &#39;Dennis&#39;, &#39;Curt&#39;, &#39;Kelly&#39;],
                    &#39;Child_number&#39;: [1, 2, 1, 2, 1, 1],
                    &#39;Child_age&#39;: [2, 4, None, None, 1, 2],
                    &#39;Parent_username&#39;: [&#39;Bob1&#39;, &#39;Bob1&#39;, &#39;Ron23&#39;, &#39;Ron23&#39;, &#39;Lisa00&#39;, &#39;Joe_&#39;],
                    &#39;Parent_age&#39;: [38, 38, None, None, 40, 26]})
  Child_name  Child_number  Child_age Parent_username  Parent_age
0       Mike             1        2.0            Bob1        38.0
1        Pat             2        4.0            Bob1        38.0
2       John             1        NaN           Ron23         NaN
3     Dennis             2        NaN           Ron23         NaN
4       Curt             1        1.0          Lisa00        40.0
5      Kelly             1        2.0            Joe_        26.0

Each row corresponds to a child, and the Child_number indicates if it's the first child or second child, etc.

In order to speed things up, I pre-allocated space for df2 by making an empty dataframe of the right size, rather than doing concatenation. I first looped through df1 by counting how many children each parent has, to get the number of rows needed for df2.

Then, I built dictionaries of indexes to map each child/parent to its row/rows in df2. I figure since dictionary lookup is fast, this is better than trying to find the row in df2 each time using where(). Again, a for loop was used for this.

Those actually does not take long. But the actual copying of the data from df to df2 takes a long time using for loop:

for index in df.index:
    for col in df.columns:
        // copy df.loc[index, col] into the corresponding position in df2 using dataframe.loc

I'm really hoping there is a faster way to do this. I don't understand vectorization very well and I'm not sure if it works well for string columns.

Please advise.
Thanks

答案1

得分: 2

你的代码运行缓慢，因为你逐个元素处理它们。你可以通过逐列处理来加快速度。下面的代码找到所有子名称列，找到它们有值（即非空）的索引，然后一次性处理所有这些字段。

我还添加了一种方法，可以提前列出所有属性，这样你就不必手动逐个重命名它们了。

cnames = [i for i in df.columns if i.startswith('Child') and i.endswith('name')]
cattrs = ['_name', '_age']
newnames = ['Child' + i for i in cattrs]
dflist = []
for childcol in cnames:
    cid = childcol.split('_')[0]
    cnum = int(cid[-1])
    attrs = [cid + i for i in cattrs]  # 获取所有属性
    attrs.extend(['Parent_username', 'Parent_age'])
    cdf = df.loc[df[childcol].dropna().index, attrs]
    cdf['Child_number'] = cnum
    cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
    dflist.append(cdf)
newdf = pd.concat(dflist)
newdf = newdf.reset_index(drop=True)

英文:

Your code is slow because you are working on each element one at a time. You can speed it up by working on a column at a time. The code below finds all child name columns, finds the indexes where they have a value (i.e. not null) and operates on all those fields at once.

I also added ways to list all the attributes ahead of time so that you don't have to manually rename them individually.

cnames =  [i for i in df.columns if i.startswith(&#39;Child&#39;) and i.endswith(&#39;name&#39;)]
cattrs = [&#39;_name&#39;, &#39;_age&#39;]
newnames = [&#39;Child&#39; + i for i in cattrs]
dflist = []
for childcol in cnames:
    cid = childcol.split(&#39;_&#39;)[0]
    cnum = int(cid[-1])
    attrs  = [cid + i for i in cattrs] # get all the attributes
    attrs.extend([&#39;Parent_username&#39;, &#39;Parent_age&#39;])
    
    cdf = df.loc[df[childcol].dropna().index, attrs]
    cdf[&#39;Child_number&#39;] = cnum
    
    cdf = cdf.rename(columns=dict(zip(attrs, newnames)))
    dflist.append(cdf)
    
newdf = pd.concat(dflist)
newdf = newdf.reset_index(drop=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas中对数据转换进行矢量化或加速循环

问题

答案1

YOLOv8与FASTAPI错误：在函数’imwrite_’中找不到指定扩展名的写入器。

Haystack：PromptNode加载模型的时间太长了。

你可以在哪里找到spacy.py文件以重命名。

Testing Python Package Dependencies

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。