Pre-allocating dynamic shaped tensor memory for ONNX runtime inference?

huangapple go评论59阅读模式
英文:

Pre-allocating dynamic shaped tensor memory for ONNX runtime inference?

问题

I am currently trying out onnxruntime-gpu and I wish to perform pre-processing of images on the GPU using NVIDIA DALI. Everything works correctly and I am able to pre-process my images, but the problem is that I wish to keep all of the data on the device to avoid data transfer bottlenecks between the device and host.

The onnxruntime library allows for IO bindings to bind inputs and outputs to the device. The problem is that this is incredibly static, which makes it challenging to pre-allocate memory for output tensors of varying shapes. For example, I am using a RetinaNet model that produces different sized predictions, which I can't seem to handle.

For pre-processing, I use the following code:

class ImagePipeline(Pipeline):
    def __init__(self, file_list, batch_size, num_threads, device_id):
        super(ImagePipeline, self).__init__(batch_size, num_threads, device_id)
        self.input = ops.readers.File(file_root="", file_list=file_list)
        self.decode = ops.decoders.Image(device="mixed", output_type=types.RGB)
        self.resize = ops.Resize(device="gpu", resize_x=800, resize_y=800)
        
        self.normalize = ops.CropMirrorNormalize(
            device="gpu",
            dtype=types.FLOAT,
            output_layout=types.NCHW,
            crop=(800, 800),
            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        )

    def define_graph(self):
            inputs, labels = self.input()
            images = self decode(inputs)
            images = self.resize(images)
            images = self.normalize(images)
            return images, labels

This code correctly creates batches of shape (BATCH_SIZE, 800, 800) images. For running inference with these batches, I use the following snippet:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
    binding = session.io_binding()
    x_tensor = x.contiguous()
    z_tensor = torch.zeros(CURR_SIZE, 4, dtype=torch_type, device=DEVICE).contiguous()

    binding.bind_input(
        name=session.get_inputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        shape=tuple(x_tensor.shape),
        buffer_ptr=x_tensor.data_ptr())
    
    binding.bind_output(
        name=session.get_outputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.int64,
        shape=tuple(x_tensor.shape),
        buffer_ptr=z_tensor.data_ptr())

    session.run_with_iobinding(binding)

    return z_tensor.squeeze(0)

This is where the problem occurs. I cannot create correctly shaped z_tensors. I use the pre-trained RetinaNet from this link.

I have found a workaround, which is the following:

def run_with_data_on_device(x):
    x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x)
    io_binding = session.io_binding()
    io_binding.bind_input(name=session.get_inputs()[0].name, device_type=x_ortvalue.device_name(), device_id=0, element_type=x.dtype, shape=x_ortvalue.shape(), buffer_ptr=x_ortvalue.data_ptr())
    io_binding.bind_output(name=session.get_outputs()[-1].name, device_type=DEVICE_NAME, device_id=DEVICE_INDEX, element_type=x.dtype, shape=x_ortvalue.shape())
    session.run_with_iobinding(io_binding)

    z = io_binding.get_outputs()

    return z[0]

But this naturally causes the problem of doing a round-trip to the host, which is unnecessary. Am I overlooking something obvious? Why can I not initialize the z_tensor as (None, None) and have a dynamically shaped output tensor?

UPDATED CODE:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
    binding = session.io_binding()
    x_tensor = x.contiguous()
    z_tensor = torch.zeros((CURR_SIZE, 91), dtype=torch_type, device=DEVICE).contiguous()

    binding.bind_input(
        name=session.get_inputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        buffer_ptr=x_tensor.data_ptr(),
        shape=x_tensor.shape)

    binding.bind_output(session.get_outputs()[-1].name, "cuda")

    session.run_with_iobinding(binding)

    ort_output = binding.get_outputs()
    return ort_output[0]

However, this returns: <onnxruntime.capi.onnxruntime_inference_collection.OrtValue object at 0x7f237bf1ebc0>

英文:

I am currently trying out onnxruntime-gpu and I wish to perform pre-processing of images on the GPU using NVIDIA DALI. Everything works correctly and I am able to pre-process my images, but the problem is that I wish to keep all of the data on device in order not to create bottle necks from copying data back and forth from device and host.

The onnxruntime library allows for IO bindings to bind inputs and outputs to the device. The problem is that this is incredibly static, which makes for issues when pre-allocating memory for output tensors of varying shapes. For example, I am using a RetinaNet which produces different sized predictions, which I can not seem to handle.

For pre-processing, I use the following code:

class ImagePipeline(Pipeline):
    def __init__(self, file_list, batch_size, num_threads, device_id):
        super(ImagePipeline, self).__init__(batch_size, num_threads, device_id)
        self.input = ops.readers.File(file_root=&quot;&quot;, file_list=file_list)
        self.decode = ops.decoders.Image(device=&quot;mixed&quot;, output_type=types.RGB)
        self.resize = ops.Resize(device=&quot;gpu&quot;, resize_x=800, resize_y=800)
        
        self.normalize = ops.CropMirrorNormalize(
            device=&quot;gpu&quot;,
            dtype=types.FLOAT,
            output_layout=types.NCHW,
            crop=(800, 800),
            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        )

    def define_graph(self):
            inputs, labels = self.input()
            images = self.decode(inputs)
            images = self.resize(images)
            images = self.normalize(images)
            return images, labels

This can correctly create batches of shape (BATCH_SIZE, 800, 800) images. For running inference with these batches, I use the following snippet:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -&gt; torch.Tensor:
    binding = session.io_binding()
    x_tensor = x.contiguous()
    z_tensor = torch.zeros(CURR_SIZE, 4, dtype=torch_type, device=DEVICE).contiguous()

    binding.bind_input(
        name=session.get_inputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        shape=tuple(x_tensor.shape),
        buffer_ptr=x_tensor.data_ptr())
    
    binding.bind_output(
        name=session.get_outputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.int64,
        shape=tuple(x_tensor.shape),
        buffer_ptr=z_tensor.data_ptr())

    session.run_with_iobinding(binding)

    return z_tensor.squeeze(0)

This is where the problem occurs. I can not create correctly shaped z_tensors. I use the pre-trained retina net from https://pytorch.org/vision/main/models/generated/torchvision.models.detection.retinanet_resnet50_fpn_v2.html#torchvision.models.detection.retinanet_resnet50_fpn_v2.

I have found a work-around which is the following:

def run_with_data_on_device(x):
    x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x)
    io_binding = session.io_binding()
    io_binding.bind_input(name=session.get_inputs()[0].name, device_type=x_ortvalue.device_name(), device_id=0, element_type=x.dtype, shape=x_ortvalue.shape(), buffer_ptr=x_ortvalue.data_ptr())
    io_binding.bind_output(name=session.get_outputs()[-1].name, device_type=DEVICE_NAME, device_id=DEVICE_INDEX, element_type=x.dtype, shape=x_ortvalue.shape())
    session.run_with_iobinding(io_binding)

    z = io_binding.get_outputs()

    return z[0]

But this naturally causes the problem of doing a round-trip to the host which is unnessecary... Am I overlooking something obvious? Why can I not initialize the z_tensor as (None, None) and have a dynamically shaped output tensor?

UPDATED CODE:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
binding = session.io_binding()
x_tensor = x.contiguous()
z_tensor = torch.zeros((CURR_SIZE,91), dtype=torch_type, device=DEVICE).contiguous()

binding.bind_input(
    name=session.get_inputs()[0].name,
    device_type=DEVICE_NAME,
    device_id=DEVICE_INDEX,
    element_type=np.float32,
    buffer_ptr=x_tensor.data_ptr(),
    shape=x_tensor.shape)



binding.bind_output(session.get_outputs()[-1].name, &quot;cuda&quot;)

session.run_with_iobinding(binding)

ort_output = binding.get_outputs()
return ort_output[0]

However, this returns: ```&lt;onnxruntime.capi.onnxruntime_inference_collection.OrtValue object at 0x7f237bf1ebc0&gt;```

</details>


# 答案1
**得分**: 2

IO Binding的API:
https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding

您实际上可以仅按名称绑定输出,因为其他参数是可选的。如果这样做,内存将由onnxruntime分配。这可以帮助处理动态输出形状的情况。

`get_outputs()` 返回设备上的OrtValues,并且`copy_outputs_to_cpu()` 可以将数据复制到CPU。

页面上还有许多示例。请参见“Data on device”部分中的第一个示例:
https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device

<details>
<summary>英文:</summary>

The API of IO Binding:
https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding

You can actually bind output with name only, since the other parameters are optional. If so, the memory will be allocated by onnxruntime. It could help the case of dynamic output shape.

get_outputs() return OrtValues in device, and copy_outputs_to_cpu() could copy data to CPU.

There are also many examples in the page. See the first example in &quot;Data on device&quot; section:
https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device



</details>



huangapple
  • 本文由 发表于 2023年2月24日 15:48:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75553839.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定