2023年2月24日 15:48:09go评论136阅读模式

英文:

Pre-allocating dynamic shaped tensor memory for ONNX runtime inference?

问题

I am currently trying out onnxruntime-gpu and I wish to perform pre-processing of images on the GPU using NVIDIA DALI. Everything works correctly and I am able to pre-process my images, but the problem is that I wish to keep all of the data on the device to avoid data transfer bottlenecks between the device and host.

The onnxruntime library allows for IO bindings to bind inputs and outputs to the device. The problem is that this is incredibly static, which makes it challenging to pre-allocate memory for output tensors of varying shapes. For example, I am using a RetinaNet model that produces different sized predictions, which I can't seem to handle.

For pre-processing, I use the following code:

class ImagePipeline(Pipeline):
    def __init__(self, file_list, batch_size, num_threads, device_id):
        super(ImagePipeline, self).__init__(batch_size, num_threads, device_id)
        self.input = ops.readers.File(file_root="", file_list=file_list)
        self.decode = ops.decoders.Image(device="mixed", output_type=types.RGB)
        self.resize = ops.Resize(device="gpu", resize_x=800, resize_y=800)
        
        self.normalize = ops.CropMirrorNormalize(
            device="gpu",
            dtype=types.FLOAT,
            output_layout=types.NCHW,
            crop=(800, 800),
            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        )

    def define_graph(self):
            inputs, labels = self.input()
            images = self decode(inputs)
            images = self.resize(images)
            images = self.normalize(images)
            return images, labels

This code correctly creates batches of shape (BATCH_SIZE, 800, 800) images. For running inference with these batches, I use the following snippet:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
    binding = session.io_binding()
    x_tensor = x.contiguous()
    z_tensor = torch.zeros(CURR_SIZE, 4, dtype=torch_type, device=DEVICE).contiguous()

    binding.bind_input(
        name=session.get_inputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        shape=tuple(x_tensor.shape),
        buffer_ptr=x_tensor.data_ptr())
    
    binding.bind_output(
        name=session.get_outputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.int64,
        shape=tuple(x_tensor.shape),
        buffer_ptr=z_tensor.data_ptr())

    session.run_with_iobinding(binding)

    return z_tensor.squeeze(0)

This is where the problem occurs. I cannot create correctly shaped z_tensors. I use the pre-trained RetinaNet from this link.

I have found a workaround, which is the following:

def run_with_data_on_device(x):
    x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x)
    io_binding = session.io_binding()
    io_binding.bind_input(name=session.get_inputs()[0].name, device_type=x_ortvalue.device_name(), device_id=0, element_type=x.dtype, shape=x_ortvalue.shape(), buffer_ptr=x_ortvalue.data_ptr())
    io_binding.bind_output(name=session.get_outputs()[-1].name, device_type=DEVICE_NAME, device_id=DEVICE_INDEX, element_type=x.dtype, shape=x_ortvalue.shape())
    session.run_with_iobinding(io_binding)

    z = io_binding.get_outputs()

    return z[0]

But this naturally causes the problem of doing a round-trip to the host, which is unnecessary. Am I overlooking something obvious? Why can I not initialize the z_tensor as (None, None) and have a dynamically shaped output tensor?

UPDATED CODE:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
    binding = session.io_binding()
    x_tensor = x.contiguous()
    z_tensor = torch.zeros((CURR_SIZE, 91), dtype=torch_type, device=DEVICE).contiguous()

    binding.bind_input(
        name=session.get_inputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        buffer_ptr=x_tensor.data_ptr(),
        shape=x_tensor.shape)

    binding.bind_output(session.get_outputs()[-1].name, "cuda")

    session.run_with_iobinding(binding)

    ort_output = binding.get_outputs()
    return ort_output[0]

However, this returns: <onnxruntime.capi.onnxruntime_inference_collection.OrtValue object at 0x7f237bf1ebc0>

英文:

I am currently trying out onnxruntime-gpu and I wish to perform pre-processing of images on the GPU using NVIDIA DALI. Everything works correctly and I am able to pre-process my images, but the problem is that I wish to keep all of the data on device in order not to create bottle necks from copying data back and forth from device and host.

The onnxruntime library allows for IO bindings to bind inputs and outputs to the device. The problem is that this is incredibly static, which makes for issues when pre-allocating memory for output tensors of varying shapes. For example, I am using a RetinaNet which produces different sized predictions, which I can not seem to handle.

For pre-processing, I use the following code:

class ImagePipeline(Pipeline):
    def __init__(self, file_list, batch_size, num_threads, device_id):
        super(ImagePipeline, self).__init__(batch_size, num_threads, device_id)
        self.input = ops.readers.File(file_root=&quot;&quot;, file_list=file_list)
        self.decode = ops.decoders.Image(device=&quot;mixed&quot;, output_type=types.RGB)
        self.resize = ops.Resize(device=&quot;gpu&quot;, resize_x=800, resize_y=800)
        
        self.normalize = ops.CropMirrorNormalize(
            device=&quot;gpu&quot;,
            dtype=types.FLOAT,
            output_layout=types.NCHW,
            crop=(800, 800),
            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        )

    def define_graph(self):
            inputs, labels = self.input()
            images = self.decode(inputs)
            images = self.resize(images)
            images = self.normalize(images)
            return images, labels

This can correctly create batches of shape (BATCH_SIZE, 800, 800) images. For running inference with these batches, I use the following snippet:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -&gt; torch.Tensor:
    binding = session.io_binding()
    x_tensor = x.contiguous()
    z_tensor = torch.zeros(CURR_SIZE, 4, dtype=torch_type, device=DEVICE).contiguous()

    binding.bind_input(
        name=session.get_inputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        shape=tuple(x_tensor.shape),
        buffer_ptr=x_tensor.data_ptr())
    
    binding.bind_output(
        name=session.get_outputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.int64,
        shape=tuple(x_tensor.shape),
        buffer_ptr=z_tensor.data_ptr())

    session.run_with_iobinding(binding)

    return z_tensor.squeeze(0)

This is where the problem occurs. I can not create correctly shaped z_tensors. I use the pre-trained retina net from https://pytorch.org/vision/main/models/generated/torchvision.models.detection.retinanet_resnet50_fpn_v2.html#torchvision.models.detection.retinanet_resnet50_fpn_v2.

I have found a work-around which is the following:

def run_with_data_on_device(x):
    x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x)
    io_binding = session.io_binding()
    io_binding.bind_input(name=session.get_inputs()[0].name, device_type=x_ortvalue.device_name(), device_id=0, element_type=x.dtype, shape=x_ortvalue.shape(), buffer_ptr=x_ortvalue.data_ptr())
    io_binding.bind_output(name=session.get_outputs()[-1].name, device_type=DEVICE_NAME, device_id=DEVICE_INDEX, element_type=x.dtype, shape=x_ortvalue.shape())
    session.run_with_iobinding(io_binding)

    z = io_binding.get_outputs()

    return z[0]

But this naturally causes the problem of doing a round-trip to the host which is unnessecary... Am I overlooking something obvious? Why can I not initialize the z_tensor as (None, None) and have a dynamically shaped output tensor?

UPDATED CODE:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
binding = session.io_binding()
x_tensor = x.contiguous()
z_tensor = torch.zeros((CURR_SIZE,91), dtype=torch_type, device=DEVICE).contiguous()

binding.bind_input(
    name=session.get_inputs()[0].name,
    device_type=DEVICE_NAME,
    device_id=DEVICE_INDEX,
    element_type=np.float32,
    buffer_ptr=x_tensor.data_ptr(),
    shape=x_tensor.shape)



binding.bind_output(session.get_outputs()[-1].name, &quot;cuda&quot;)

session.run_with_iobinding(binding)

ort_output = binding.get_outputs()
return ort_output[0]


However, this returns: ```&lt;onnxruntime.capi.onnxruntime_inference_collection.OrtValue object at 0x7f237bf1ebc0&gt;```

</details>


# 答案1
**得分**: 2

IO Binding的API：
https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding

您实际上可以仅按名称绑定输出，因为其他参数是可选的。如果这样做，内存将由onnxruntime分配。这可以帮助处理动态输出形状的情况。

`get_outputs()` 返回设备上的OrtValues，并且`copy_outputs_to_cpu()` 可以将数据复制到CPU。

页面上还有许多示例。请参见“Data on device”部分中的第一个示例：
https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device

<details>
<summary>英文:</summary>

The API of IO Binding:
https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding

You can actually bind output with name only, since the other parameters are optional. If so, the memory will be allocated by onnxruntime. It could help the case of dynamic output shape.

get_outputs() return OrtValues in device, and copy_outputs_to_cpu() could copy data to CPU.

There are also many examples in the page. See the first example in &quot;Data on device&quot; section:
https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device



</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pre-allocating dynamic shaped tensor memory for ONNX runtime inference?

问题

如何在加载预训练的转换模型时跳过权重初始化？

如何将 requires_grad_ 设置为 false（冻结）PyTorch 惰性层？

What will be the gradient and weight of the particula part of the network if coeffiecient of one of the losses contributed by that network is zero?

CNN模型的准确性出现奇怪的波动。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论