英文:
How can performance be improved when iterating over NumPy array?
问题
我正在分析由激光扫描仪创建的大量点云数据。在第三步中,我根据它们的z值删除点,但我的函数速度非常慢。
-
导入
数据从一个.csv文件中使用pandas导入。导入的数据框 'df' 包含X、Y、Z的数据。例如:
df的形状是[300,1001]。然后X是df的前三分之一。X = df.iloc[:99,1:], Y是df.iloc[100:199,1:],以此类推。
第一列(索引)不相关。X、Y、Z中的一行对应于单个扫描的数据。 -
转换为NumPy
数据框 'df' 包含许多空字段''。因此,我将数据结构更改为形状为(N,3)的NumPy数组'A',其中每一行代表一个单独的点。删除所有包含空值的点。 -
根据扫描的最大高度删除点
我只关心每个扫描的最大高度略低的点。我使用我的函数'in_max_height'创建了一个所有在允许范围内的点的掩码。
这是我的代码:
def in_max_height(A,hMax):
# 获取唯一的x值
unique_x = np.unique(A[:,0])
# 创建一个与A相同形状的空掩码数组
mask = np.zeros_like(A[:,2], dtype=bool)
# 遍历唯一的x值并找到最大的z值
for x in unique_x:
zMax = np.max(A[A[:,0] == x, 2])
mask[A[:,0] == x] = ~(A[A[:,0] == x, 2] < zMax - hMax)
return mask
A = A[in_max_height(A,hMax=1)] # 应用最大层高
- 分析
创建各种图表...
我尝试在第一步之后删除“低点”,但我无法弄清如何忽略数据框的索引列。
现在,对于平均点云包含约375,000个点的情况,我的函数需要约11秒才能完成。我想学习如何从根本上解决这些大数据问题。
英文:
I'm analyzing large amounts point cloud data created by a laser scanner. In the third step I remove points based on their z-value but my function is really slow.
-
Import
The data is imported from a .csv file using pandas. The imported dataframe 'df' contains the data for X,Y,Z. Example:
df has a shape [300,1001]. Then X is the first third of df. X = df.iloc[:99,1:], Y is df.iloc[100:199,1:] and so on.
The first column (index) is irrelevant. One row in X,Y,Z corresponds to the data of a single scan. -
Convert to NumPy
The dataframe 'df' contains many empty fields ''. Therefore I change the data structure to a NumPy array 'A' of shape (N,3) in which every row represents a single point. All points containing empty values are deleted. -
Remove points based on max. height of a scan.
I'm only interested in the points slightly below the maximum of each scan. I use my function 'in_max_height' to create a mask of all points within the allowed range.
Here's my code:
def in_max_height(A,hMax):
# get unique x values
unique_x = np.unique(A[:,0])
# create an empty mask array with the same shape as A
mask = np.zeros_like(A[:,2], dtype=bool)
# iterate over unique x and find the max. z-value
for x in unique_x:
zMax = np.max(A[A[:,0] == x, 2])
mask[A[:,0] == x] = ~(A[A[:,0] == x, 2] < zMax - hMax)
return mask
A = A[in_max_height(A,hMax=1)] # apply max. layer height
- Analyze
Create various plots...
I tried to remove the low points after step 1 but I couldn't figure out how to ignore the index column of the dataframe.
Right now with an average point cloud consisting of about 375,000 points my function takes about 11 s to finish. I would like to learn how to fundamentally tackle these big data problems.
答案1
得分: 1
我承认我的代码不够优化,但在我的笔记本上运行速度比11秒快:
```python
import random
import numpy as np
import time
def get_random_point():
i = 1950
return (random.randint(0, i), random.randint(0, i), random.randint(0, i/10))
# 构建包含375,000个点和1950个唯一值的测试数组
test_array = np.array([get_random_point() for x in range(375000)],dtype=np.int64)
print(test_array.shape)
(375000, 3)
start = time.time()
# 根据第一列和最后一列按降序排序
tsorted = test_array[np.lexsort((test_array[:,2], test_array[:,0]))][::-1]
res = []
u = tsorted[0][0]
z_max = tsorted[0][2]
hmax = 1
for x in tsorted:
if x[0] != u or not res:
u = x[0]
z_max = x[2]
res.append(x)
else:
if x[2] + hmax >= z_max:
res.append(x)
res = np.array(res)
print(time.time() - start)
# 单位为秒
0.47696924209594727
<details>
<summary>英文:</summary>
I admit that my code is not optimal but it's work faster than 11s on my laptop:
import random
import numpy as np
import time
def get_random_point():
i = 1950
return (random.randint(0, i), random.randint(0, i), random.randint(0, i/10))
Construct test array with 375000 points and 1950 unique values
test_array = np.array([get_random_point() for x in range(375000)],dtype=np.int64)
print(test_array.shape)
(375000, 3)
start = time.time()
Sort on first and last column decreasing order
tsorted = test_array[np.lexsort((test_array[:,2], test_array[:,0]))][::-1]
res = []
u = tsorted[0][0]
z_max = tsorted[0][2]
hmax = 1
for x in tsorted:
if x[0] != u or not res:
u = x[0]
z_max = x[2]
res.append(x)
else:
if x[2] + hmax >= z_max:
res.append(x)
res = np.array(res)
print(time.time() - start)
in secs
0.47696924209594727
</details>
# 答案2
**得分**: 0
我通过在数据框阶段修复问题,甚至使速度稍微更快了。
%timeit: 413 毫秒 ± 13.2 毫秒每次循环(均值 ± 标准差,7 次运行,1 次循环每次)
```python
def in_max_height(df, hMax):
"""
将小于 zMax - hMax 的 z 值替换为 NaN。
参数
----------
df : DataFrame
经过 df2float 处理的数据
hMax : float
允许的最大层高度
返回
-------
df : DataFrame
具有 z 值 < zMax - hMax 的 NaN 的数据框
"""
# 加载 z 数据
dfz = df[int(2/3*len(df.index)):]
# 找到每行中的最大值
max_values = dfz.iloc[:, 1:].max(axis=1)
# 将 z < max_values - hMax 设置为 NaN
df[dfz < max_values.values[:, None] - hMax] = np.nan
return df
英文:
I managed to get even a bit faster by fixing the problem at the dataframe stage.
%timeit: 413 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def in_max_height(df,hMax):
"""
Replace z-values smaller than zMax - hMax with NaN.
Parameters
----------
df : DataFrame
data after df2float
hMax : float
max. allowed height of layer
Returns
-------
df : DataFrame
dataFrame with NaN for z-values < zMax - hMax
"""
# load z-data
dfz = df[int(2/3*len(df.index)):]
# Find the maximum value in each row
max_values = dfz.iloc[:,1:].max(axis=1)
# Set z < max_values - hMax to NaN
df[dfz < max_values.values[:, None] - hMax] = np.nan
return df
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论