英文:
Pre-allocating dynamic shaped tensor memory for ONNX runtime inference?
问题
I am currently trying out onnxruntime-gpu
and I wish to perform pre-processing of images on the GPU using NVIDIA DALI. Everything works correctly and I am able to pre-process my images, but the problem is that I wish to keep all of the data on the device to avoid data transfer bottlenecks between the device and host.
The onnxruntime
library allows for IO bindings to bind inputs and outputs to the device. The problem is that this is incredibly static, which makes it challenging to pre-allocate memory for output tensors of varying shapes. For example, I am using a RetinaNet model that produces different sized predictions, which I can't seem to handle.
For pre-processing, I use the following code:
class ImagePipeline(Pipeline):
def __init__(self, file_list, batch_size, num_threads, device_id):
super(ImagePipeline, self).__init__(batch_size, num_threads, device_id)
self.input = ops.readers.File(file_root="", file_list=file_list)
self.decode = ops.decoders.Image(device="mixed", output_type=types.RGB)
self.resize = ops.Resize(device="gpu", resize_x=800, resize_y=800)
self.normalize = ops.CropMirrorNormalize(
device="gpu",
dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(800, 800),
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
)
def define_graph(self):
inputs, labels = self.input()
images = self decode(inputs)
images = self.resize(images)
images = self.normalize(images)
return images, labels
This code correctly creates batches of shape (BATCH_SIZE, 800, 800) images. For running inference with these batches, I use the following snippet:
def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
binding = session.io_binding()
x_tensor = x.contiguous()
z_tensor = torch.zeros(CURR_SIZE, 4, dtype=torch_type, device=DEVICE).contiguous()
binding.bind_input(
name=session.get_inputs()[0].name,
device_type=DEVICE_NAME,
device_id=DEVICE_INDEX,
element_type=np.float32,
shape=tuple(x_tensor.shape),
buffer_ptr=x_tensor.data_ptr())
binding.bind_output(
name=session.get_outputs()[0].name,
device_type=DEVICE_NAME,
device_id=DEVICE_INDEX,
element_type=np.int64,
shape=tuple(x_tensor.shape),
buffer_ptr=z_tensor.data_ptr())
session.run_with_iobinding(binding)
return z_tensor.squeeze(0)
This is where the problem occurs. I cannot create correctly shaped z_tensors. I use the pre-trained RetinaNet from this link.
I have found a workaround, which is the following:
def run_with_data_on_device(x):
x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x)
io_binding = session.io_binding()
io_binding.bind_input(name=session.get_inputs()[0].name, device_type=x_ortvalue.device_name(), device_id=0, element_type=x.dtype, shape=x_ortvalue.shape(), buffer_ptr=x_ortvalue.data_ptr())
io_binding.bind_output(name=session.get_outputs()[-1].name, device_type=DEVICE_NAME, device_id=DEVICE_INDEX, element_type=x.dtype, shape=x_ortvalue.shape())
session.run_with_iobinding(io_binding)
z = io_binding.get_outputs()
return z[0]
But this naturally causes the problem of doing a round-trip to the host, which is unnecessary. Am I overlooking something obvious? Why can I not initialize the z_tensor as (None, None) and have a dynamically shaped output tensor?
UPDATED CODE:
def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
binding = session.io_binding()
x_tensor = x.contiguous()
z_tensor = torch.zeros((CURR_SIZE, 91), dtype=torch_type, device=DEVICE).contiguous()
binding.bind_input(
name=session.get_inputs()[0].name,
device_type=DEVICE_NAME,
device_id=DEVICE_INDEX,
element_type=np.float32,
buffer_ptr=x_tensor.data_ptr(),
shape=x_tensor.shape)
binding.bind_output(session.get_outputs()[-1].name, "cuda")
session.run_with_iobinding(binding)
ort_output = binding.get_outputs()
return ort_output[0]
However, this returns: <onnxruntime.capi.onnxruntime_inference_collection.OrtValue object at 0x7f237bf1ebc0>
英文:
I am currently trying out onnxruntime-gpu
and I wish to perform pre-processing of images on the GPU using NVIDIA DALI. Everything works correctly and I am able to pre-process my images, but the problem is that I wish to keep all of the data on device in order not to create bottle necks from copying data back and forth from device and host.
The onnxruntime
library allows for IO bindings to bind inputs and outputs to the device. The problem is that this is incredibly static, which makes for issues when pre-allocating memory for output tensors of varying shapes. For example, I am using a RetinaNet which produces different sized predictions, which I can not seem to handle.
For pre-processing, I use the following code:
class ImagePipeline(Pipeline):
def __init__(self, file_list, batch_size, num_threads, device_id):
super(ImagePipeline, self).__init__(batch_size, num_threads, device_id)
self.input = ops.readers.File(file_root="", file_list=file_list)
self.decode = ops.decoders.Image(device="mixed", output_type=types.RGB)
self.resize = ops.Resize(device="gpu", resize_x=800, resize_y=800)
self.normalize = ops.CropMirrorNormalize(
device="gpu",
dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(800, 800),
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
)
def define_graph(self):
inputs, labels = self.input()
images = self.decode(inputs)
images = self.resize(images)
images = self.normalize(images)
return images, labels
This can correctly create batches of shape (BATCH_SIZE, 800, 800) images. For running inference with these batches, I use the following snippet:
def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
binding = session.io_binding()
x_tensor = x.contiguous()
z_tensor = torch.zeros(CURR_SIZE, 4, dtype=torch_type, device=DEVICE).contiguous()
binding.bind_input(
name=session.get_inputs()[0].name,
device_type=DEVICE_NAME,
device_id=DEVICE_INDEX,
element_type=np.float32,
shape=tuple(x_tensor.shape),
buffer_ptr=x_tensor.data_ptr())
binding.bind_output(
name=session.get_outputs()[0].name,
device_type=DEVICE_NAME,
device_id=DEVICE_INDEX,
element_type=np.int64,
shape=tuple(x_tensor.shape),
buffer_ptr=z_tensor.data_ptr())
session.run_with_iobinding(binding)
return z_tensor.squeeze(0)
This is where the problem occurs. I can not create correctly shaped z_tensors. I use the pre-trained retina net from https://pytorch.org/vision/main/models/generated/torchvision.models.detection.retinanet_resnet50_fpn_v2.html#torchvision.models.detection.retinanet_resnet50_fpn_v2.
I have found a work-around which is the following:
def run_with_data_on_device(x):
x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x)
io_binding = session.io_binding()
io_binding.bind_input(name=session.get_inputs()[0].name, device_type=x_ortvalue.device_name(), device_id=0, element_type=x.dtype, shape=x_ortvalue.shape(), buffer_ptr=x_ortvalue.data_ptr())
io_binding.bind_output(name=session.get_outputs()[-1].name, device_type=DEVICE_NAME, device_id=DEVICE_INDEX, element_type=x.dtype, shape=x_ortvalue.shape())
session.run_with_iobinding(io_binding)
z = io_binding.get_outputs()
return z[0]
But this naturally causes the problem of doing a round-trip to the host which is unnessecary... Am I overlooking something obvious? Why can I not initialize the z_tensor as (None, None) and have a dynamically shaped output tensor?
UPDATED CODE:
def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
binding = session.io_binding()
x_tensor = x.contiguous()
z_tensor = torch.zeros((CURR_SIZE,91), dtype=torch_type, device=DEVICE).contiguous()
binding.bind_input(
name=session.get_inputs()[0].name,
device_type=DEVICE_NAME,
device_id=DEVICE_INDEX,
element_type=np.float32,
buffer_ptr=x_tensor.data_ptr(),
shape=x_tensor.shape)
binding.bind_output(session.get_outputs()[-1].name, "cuda")
session.run_with_iobinding(binding)
ort_output = binding.get_outputs()
return ort_output[0]
However, this returns: ```<onnxruntime.capi.onnxruntime_inference_collection.OrtValue object at 0x7f237bf1ebc0>```
</details>
# 答案1
**得分**: 2
IO Binding的API:
https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding
您实际上可以仅按名称绑定输出,因为其他参数是可选的。如果这样做,内存将由onnxruntime分配。这可以帮助处理动态输出形状的情况。
`get_outputs()` 返回设备上的OrtValues,并且`copy_outputs_to_cpu()` 可以将数据复制到CPU。
页面上还有许多示例。请参见“Data on device”部分中的第一个示例:
https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device
<details>
<summary>英文:</summary>
The API of IO Binding:
https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding
You can actually bind output with name only, since the other parameters are optional. If so, the memory will be allocated by onnxruntime. It could help the case of dynamic output shape.
get_outputs() return OrtValues in device, and copy_outputs_to_cpu() could copy data to CPU.
There are also many examples in the page. See the first example in "Data on device" section:
https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论