迭代快速浏览h5文件并执行一些计算

huangapple go评论71阅读模式
英文:

Iterating fast over h5 file and perform some calculations

问题

我需要一个超快的解决方案,对于我提供的9000个数据点,最多需要5秒钟。原因是实际数据实际上是数百万行。

任务如下:给定h5文件中不同老鼠不同身体部位的坐标数据。读取h5文件(希望是numpy.array,而不是下面我所做的pandas),然后基于tail1、tail2和tail3身体部位计算质心。

我的怀疑是.loc索引是导致问题的原因,通常情况下,DataFrame迭代是次优的。

我所做的是标准的.loc索引:

filename="look at the h5 file in the link" # 上面的h5文件
new_centroid_trackings = np.array([[0,0,0,0,0,0,0,0]]) # 初始化数据以在每次迭代后连接
model_name="DLC_resnet50_4mice_new_video_no_wheelFeb17shuffle1_220000" # 对任务不相关
tracking_coords = pd.read_hdf(filename) # 读取数据

for frame in range(tracking_coords.shape[0]):

    centroid_mouse1_x=(tracking_coords.loc[frame, model_name]["mouse1"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail3"]["x"])/3
    centroid_mouse1_y=(tracking_coords.loc[frame, model_name]["mouse1"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail3"]["y"])/3

    centroid_mouse2_x=(tracking_coords.loc[frame, model_name]["mouse2"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail3"]["x"])/3
    centroid_mouse2_y=(tracking_coords.loc[frame, model_name]["mouse2"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail3"]["y"])/3

    centroid_mouse3_x=(tracking_coords.loc[frame, model_name]["mouse3"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail3"]["x"])/3
    centroid_mouse3_y=(tracking_coords.loc[frame, model_name]["mouse3"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail3"]["y"])/3

    centroid_mouse4_x=(tracking_coords.loc[frame, model_name]["mouse4"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail4"]["x"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail3"]["x"])/3
    centroid_mouse4_y=(tracking_coords.loc[frame, model_name]["mouse4"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail4"]["y"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail3"]["y"])/3

    # 现在将质心连接到之前的质心

    new_centroid_trackings=np.concatenate((new_centroid_trackings, np.array([[centroid_mouse1_x,centroid_mouse1_y,centroid_mouse2_x, centroid_mouse2_y, centroid_mouse3_x, centroid_mouse3_y, centroid_mouse4_x, centroid_mouse4_y]])), axis=0)

对于这个过程,需要大约90秒来处理所有行。

需要的解决方案:使用Numpy(或其他方法),最多需要5秒处理所有行。

英文:
  1. I need a super fast solution, that needs maximally 5 seconds on the 9000 datapoints I provide in the link. Reason is because the real data is actually millions of rows.
  2. Link to the h5 file: https://drive.google.com/file/d/16aI3plRFa3M6nSIiT1XioUIgsPYl1Wg8/view?usp=sharing

The task at hand is as follows: Given the coordinate data of different body parts of different mice in the h5 file. Read in the h5 file (hopefully as numpy.array not as pandas what I did underneath) and then calculate the centroid based on tail1, tail2 and tail3 body parts.

My suspicion underneath is that .loc indexing is what causes the problem and generally dataframe iteration is sub-optimal.

What I have done is standard loc indexing:

filename="look at the h5 file in the link" # h5 above
new_centroid_trackings = np.array([[0,0,0,0,0,0,0,0]]) # initialize the data to concatinate after every iteration
model_name="DLC_resnet50_4mice_new_video_no_wheelFeb17shuffle1_220000" # not relevant for task
tracking_coords = pd.read_hdf(filename) # read in the data
for frame in range(tracking_coords.shape[0]):
centroid_mouse1_x=(tracking_coords.loc[frame, model_name]["mouse1"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail3"]["x"])/3
centroid_mouse1_y=(tracking_coords.loc[frame, model_name]["mouse1"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail3"]["y"])/3
centroid_mouse2_x=(tracking_coords.loc[frame, model_name]["mouse2"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail3"]["x"])/3
centroid_mouse2_y=(tracking_coords.loc[frame, model_name]["mouse2"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail3"]["y"])/3      
centroid_mouse3_x=(tracking_coords.loc[frame, model_name]["mouse3"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail3"]["x"])/3
centroid_mouse3_y=(tracking_coords.loc[frame, model_name]["mouse3"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail3"]["y"])/3
centroid_mouse4_x=(tracking_coords.loc[frame, model_name]["mouse4"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail4"]["x"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail3"]["x"])/3
centroid_mouse4_y=(tracking_coords.loc[frame, model_name]["mouse4"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail4"]["y"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail3"]["y"])/3
# now concatinate the centroids to the previous ones
new_centroid_trackings=np.concatenate((new_centroid_trackings, np.array([[centroid_mouse1_x,centroid_mouse1_y,centroid_mouse2_x, centroid_mouse2_y, centroid_mouse3_x, centroid_mouse3_y, centroid_mouse4_x, centroid_mouse4_y]])), axis=0)

And for this around 90 seconds is needed for all the rows.

Needed solution: Numpy (or not) solution that takes maximally 5 seconds for all the rows.

答案1

得分: 1

你有两个主要问题会减慢循环内的计算:

  • 多级列数据框:每次访问每个级别时会花费更多时间。我通过以下方式解决了这个问题,将数据框中的多级列转换为单级数据框:
tracking_coords.columns = ['_'.join(w) for w in tracking_coords.columns.values]
  • 数组连接:最后一行包含连接操作,这会花费很多时间。因此,如果您事先知道最终数组的形状,强烈建议使用预初始化最终数组(new_centroid_trackings)。因此,我通过以下方式解决了这个问题:
new_centroid_trackings = np.zeros((len(tracking_coords), 8))

它在不到5秒内完成整个循环。我稍微更改了循环内的列名称以适应新的列名称。

整个代码:

import pandas as pd 
import numpy as np

filename = "file.h5"  # h5文件
model_name = "DLC_resnet50_4mice_new_video_no_wheelFeb17shuffle1_220000"  # 与任务无关
tracking_coords = pd.read_hdf(filename)  # 读取数据

tracking_coords.columns = ['_'.join(w) for w in tracking_coords.columns.values]

new_centroid_trackings = np.zeros((len(tracking_coords), 8))

for frame in range(tracking_coords.shape[0]):
    centroid_mouse1_x = (tracking_coords.loc[frame, model_name + "_mouse1_tail1_x"] + tracking_coords.loc[frame, model_name + "_mouse1_tail2_x"] + tracking_coords.loc[frame, model_name + "_mouse1_tail3_x"]) / 3
    centroid_mouse1_y = (tracking_coords.loc[frame, model_name + "_mouse1_tail1_y"] + tracking_coords.loc[frame, model_name + "_mouse1_tail2_y"] + tracking_coords.loc[frame, model_name + "_mouse1_tail3_y"]) / 3

    centroid_mouse2_x = (tracking_coords.loc[frame, model_name + "_mouse2_tail1_x"] + tracking_coords.loc[frame, model_name + "_mouse2_tail2_x"] + tracking_coords.loc[frame, model_name + "_mouse2_tail3_x"]) / 3
    centroid_mouse2_y = (tracking_coords.loc[frame, model_name + "_mouse2_tail1_y"] + tracking_coords.loc[frame, model_name + "_mouse2_tail2_y"] + tracking_coords.loc[frame, model_name + "_mouse2_tail3_y"]) / 3

    centroid_mouse3_x = (tracking_coords.loc[frame, model_name + "_mouse3_tail1_x"] + tracking_coords.loc[frame, model_name + "_mouse3_tail2_x"] + tracking_coords.loc[frame, model_name + "_mouse3_tail3_x"]) / 3
    centroid_mouse3_y = (tracking_coords.loc[frame, model_name + "_mouse3_tail1_y"] + tracking_coords.loc[frame, model_name + "_mouse3_tail2_y"] + tracking_coords.loc[frame, model_name + "_mouse3_tail3_y"]) / 3

    centroid_mouse4_x = (tracking_coords.loc[frame, model_name + "_mouse4_tail1_x"] + tracking_coords.loc[frame, model_name + "_mouse4_tail2_x"] + tracking_coords.loc[frame, model_name + "_mouse4_tail3_x"]) / 3
    centroid_mouse4_y = (tracking_coords.loc[frame, model_name + "_mouse4_tail1_y"] + tracking_coords.loc[frame, model_name + "_mouse4_tail2_y"] + tracking_coords.loc[frame, model_name + "_mouse4_tail3_y"]) / 3

    new_centroid_trackings[frame, :] = [centroid_mouse1_x, centroid_mouse1_y, centroid_mouse2_x, centroid_mouse2_y, centroid_mouse3_x, centroid_mouse3_y, centroid_mouse4_x, centroid_mouse4_y]
英文:

You have two main issues which slow down the calculation inside the loop:

  • Multi-level columns dataframe: it takes longer when you access each time each level. I solved this problem converting the multilevel columns in the dataframe to a single-level dataframe by the following:

    tracking_coords.columns = ['_'.join(w) for w in tracking_coords.columns.values]
    
  • Array Concatenation: the last line contains a concatenation operation which costs a lot. Therefore, it is highly recommended using pre-initailization for the final array (new_centroid_trackings) if you know the shape of the final array beforehand. Thus, I solved it by:

    new_centroid_trackings = np.zeros((len(tracking_coords),8))
    

It finishes the entire loop in less than 5 sec. I changed slightly the columns names inside the loop to adapt the new columns names.

The entire code:

import pandas as pd 
import numpy as np
filename="file.h5" # h5 above
model_name="DLC_resnet50_4mice_new_video_no_wheelFeb17shuffle1_220000" # not relevant for task
tracking_coords = pd.read_hdf(filename) # read in the data
tracking_coords.columns = ['_'.join(w) for w in tracking_coords.columns.values]
new_centroid_trackings = np.zeros((len(tracking_coords),8))
for frame in range(tracking_coords.shape[0]):
centroid_mouse1_x=(tracking_coords.loc[frame, model_name+"_mouse1_tail1_x"]+tracking_coords.loc[frame, model_name+"_mouse1_tail2_x"]+tracking_coords.loc[frame, model_name+"_mouse1_tail3_x"])/3
centroid_mouse1_y=(tracking_coords.loc[frame, model_name+"_mouse1_tail1_y"]+tracking_coords.loc[frame, model_name+"_mouse1_tail2_y"]+tracking_coords.loc[frame, model_name+"_mouse1_tail3_y"])/3
centroid_mouse2_x=(tracking_coords.loc[frame, model_name+"_mouse2_tail1_x"]+tracking_coords.loc[frame, model_name+"_mouse2_tail2_x"]+tracking_coords.loc[frame, model_name+"_mouse2_tail3_x"])/3
centroid_mouse2_y=(tracking_coords.loc[frame, model_name+"_mouse2_tail1_y"]+tracking_coords.loc[frame, model_name+"_mouse2_tail2_y"]+tracking_coords.loc[frame, model_name+"_mouse2_tail3_y"])/3
centroid_mouse3_x=(tracking_coords.loc[frame, model_name+"_mouse3_tail1_x"]+tracking_coords.loc[frame, model_name+"_mouse3_tail2_x"]+tracking_coords.loc[frame, model_name+"_mouse3_tail3_x"])/3
centroid_mouse3_y=(tracking_coords.loc[frame, model_name+"_mouse3_tail1_y"]+tracking_coords.loc[frame, model_name+"_mouse3_tail2_y"]+tracking_coords.loc[frame, model_name+"_mouse3_tail3_y"])/3
centroid_mouse4_x=(tracking_coords.loc[frame, model_name+"_mouse4_tail1_x"]+tracking_coords.loc[frame, model_name+"_mouse4_tail2_x"]+tracking_coords.loc[frame, model_name+"_mouse4_tail3_x"])/3
centroid_mouse4_y=(tracking_coords.loc[frame, model_name+"_mouse4_tail1_y"]+tracking_coords.loc[frame, model_name+"_mouse4_tail2_y"]+tracking_coords.loc[frame, model_name+"_mouse4_tail3_y"])/3
new_centroid_trackings[frame,:] = [centroid_mouse1_x,centroid_mouse1_y,centroid_mouse2_x, centroid_mouse2_y, centroid_mouse3_x, centroid_mouse3_y, centroid_mouse4_x, centroid_mouse4_y]

huangapple
  • 本文由 发表于 2023年2月24日 17:24:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75554753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定