2023年6月5日 20:55:18go评论75阅读模式

英文:

How can performance be improved when iterating over NumPy array?

问题

我正在分析由激光扫描仪创建的大量点云数据。在第三步中，我根据它们的z值删除点，但我的函数速度非常慢。

导入
数据从一个.csv文件中使用pandas导入。导入的数据框 'df' 包含X、Y、Z的数据。例如：
df的形状是[300,1001]。然后X是df的前三分之一。X = df.iloc[:99,1:], Y是df.iloc[100:199,1:]，以此类推。
第一列（索引）不相关。X、Y、Z中的一行对应于单个扫描的数据。
转换为NumPy
数据框 'df' 包含许多空字段''。因此，我将数据结构更改为形状为(N,3)的NumPy数组'A'，其中每一行代表一个单独的点。删除所有包含空值的点。
根据扫描的最大高度删除点
我只关心每个扫描的最大高度略低的点。我使用我的函数'in_max_height'创建了一个所有在允许范围内的点的掩码。

这是我的代码：

def in_max_height(A,hMax):

    # 获取唯一的x值
    unique_x = np.unique(A[:,0])

    # 创建一个与A相同形状的空掩码数组
    mask = np.zeros_like(A[:,2], dtype=bool)

    # 遍历唯一的x值并找到最大的z值
    for x in unique_x:
        zMax = np.max(A[A[:,0] == x, 2])
        mask[A[:,0] == x] = ~(A[A[:,0] == x, 2] < zMax - hMax)

    return mask

 A = A[in_max_height(A,hMax=1)] # 应用最大层高

分析
创建各种图表...

我尝试在第一步之后删除“低点”，但我无法弄清如何忽略数据框的索引列。

现在，对于平均点云包含约375,000个点的情况，我的函数需要约11秒才能完成。我想学习如何从根本上解决这些大数据问题。

英文:

I'm analyzing large amounts point cloud data created by a laser scanner. In the third step I remove points based on their z-value but my function is really slow.

Import
The data is imported from a .csv file using pandas. The imported dataframe 'df' contains the data for X,Y,Z. Example:
df has a shape [300,1001]. Then X is the first third of df. X = df.iloc[:99,1:], Y is df.iloc[100:199,1:] and so on.
The first column (index) is irrelevant. One row in X,Y,Z corresponds to the data of a single scan.
Convert to NumPy
The dataframe 'df' contains many empty fields ''. Therefore I change the data structure to a NumPy array 'A' of shape (N,3) in which every row represents a single point. All points containing empty values are deleted.
Remove points based on max. height of a scan.
I'm only interested in the points slightly below the maximum of each scan. I use my function 'in_max_height' to create a mask of all points within the allowed range.

Here's my code:

def in_max_height(A,hMax):

    # get unique x values
    unique_x = np.unique(A[:,0])

    # create an empty mask array with the same shape as A
    mask = np.zeros_like(A[:,2], dtype=bool)

    # iterate over unique x and find the max. z-value
    for x in unique_x:
        zMax = np.max(A[A[:,0] == x, 2])
        mask[A[:,0] == x] = ~(A[A[:,0] == x, 2] &lt; zMax - hMax)

    return mask

 A = A[in_max_height(A,hMax=1)] # apply max. layer height

Analyze
Create various plots...

I tried to remove the low points after step 1 but I couldn't figure out how to ignore the index column of the dataframe.

Right now with an average point cloud consisting of about 375,000 points my function takes about 11 s to finish. I would like to learn how to fundamentally tackle these big data problems.

答案1

得分: 1

我承认我的代码不够优化，但在我的笔记本上运行速度比11秒快：

```python
import random
import numpy as np
import time

def get_random_point():
    i = 1950
    return (random.randint(0, i), random.randint(0, i),  random.randint(0, i/10))

# 构建包含375,000个点和1950个唯一值的测试数组
test_array = np.array([get_random_point() for x in range(375000)],dtype=np.int64)
print(test_array.shape)
(375000, 3)

start = time.time()
# 根据第一列和最后一列按降序排序
tsorted =  test_array[np.lexsort((test_array[:,2], test_array[:,0]))][::-1]

res = []
u = tsorted[0][0]
z_max = tsorted[0][2]
hmax = 1
for x in tsorted:
    if x[0] != u or not res:
        
        u = x[0]
        z_max = x[2]
        res.append(x)
    else:
        if x[2] + hmax &gt;= z_max:
            res.append(x)
res = np.array(res)
print(time.time() - start)
# 单位为秒
0.47696924209594727


<details>
<summary>英文:</summary>

I admit that my code is not optimal but it&#39;s work faster than 11s on my laptop:

import random
import numpy as np
import time

def get_random_point():
i = 1950
return (random.randint(0, i), random.randint(0, i), random.randint(0, i/10))

Construct test array with 375000 points and 1950 unique values

test_array = np.array([get_random_point() for x in range(375000)],dtype=np.int64)
print(test_array.shape)
(375000, 3)

start = time.time()

Sort on first and last column decreasing order

tsorted = test_array[np.lexsort((test_array[:,2], test_array[:,0]))][::-1]

res = []
u = tsorted[0][0]
z_max = tsorted[0][2]
hmax = 1
for x in tsorted:
if x[0] != u or not res:

    u = x[0]
    z_max = x[2]
    res.append(x)
else:
    if x[2] + hmax &gt;= z_max:
        res.append(x)

res = np.array(res)
print(time.time() - start)

in secs

0.47696924209594727


</details>



# 答案2
**得分**: 0

我通过在数据框阶段修复问题，甚至使速度稍微更快了。

%timeit: 413 毫秒 ± 13.2 毫秒每次循环（均值 ± 标准差，7 次运行，1 次循环每次）

```python
def in_max_height(df, hMax):
    """
    将小于 zMax - hMax 的 z 值替换为 NaN。

    参数
    ----------
    df : DataFrame
        经过 df2float 处理的数据
    hMax : float
        允许的最大层高度

    返回
    -------
    df : DataFrame
        具有 z 值 < zMax - hMax 的 NaN 的数据框
    """
    
    # 加载 z 数据
    dfz = df[int(2/3*len(df.index)):]

    # 找到每行中的最大值
    max_values = dfz.iloc[:, 1:].max(axis=1)

    # 将 z < max_values - hMax 设置为 NaN
    df[dfz < max_values.values[:, None] - hMax] = np.nan

    return df

英文:

I managed to get even a bit faster by fixing the problem at the dataframe stage.

%timeit: 413 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def in_max_height(df,hMax):
&quot;&quot;&quot;
Replace z-values smaller than zMax - hMax with NaN.

Parameters
----------
df : DataFrame
    data after df2float
hMax : float
    max. allowed height of layer

Returns
-------
df : DataFrame
    dataFrame with NaN for z-values &lt; zMax - hMax
&quot;&quot;&quot;

# load z-data
dfz = df[int(2/3*len(df.index)):]

# Find the maximum value in each row
max_values = dfz.iloc[:,1:].max(axis=1)

# Set z &lt; max_values - hMax to NaN
df[dfz &lt; max_values.values[:, None] - hMax] = np.nan

return df

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在迭代NumPy数组时提高性能？

问题

答案1

Construct test array with 375000 points and 1950 unique values

Sort on first and last column decreasing order

in secs

List files in specified directory without subdirectories.

如何在XPath和Python中使用preceding-sibling？它似乎显示错误的输出。

在pySpark中计算非唯一列表元素的累积和。

Pyngrok: 在下载ngrok时SSL证书验证失败

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论