使用 TensorFlow、ONNX 和 NVIDIA TensorRT 加快深度學習推論

本文章於 2021 年 7 月 20 日更新，以因應 NVIDIA TensorRT 8.0 更新。

本文章教導如何使用新的 TensorFlow-ONNX-TensorRT 工作流程，部署 TensorFlow 已訓練的深度學習模型。本教學使用 NVIDIA TensorRT 8.0.0.3，並提供兩個程式碼範例，一個用於 TensorFlow v1，一個用於 TensorFlow v2。TensorRT 是推論加速器。

首先，使用任何框架訓練網路。在訓練網路之後，批次大小和精度為固定（精度為 FP32、FP16 或 INT8）。將已訓練模型傳遞至 TensorRT 最佳化工具，進而輸出最佳化執行階段，又稱為計畫。.plan 檔案是 TensorRT 引擎的序列化檔案格式。必須將計畫檔案還原序列化，才能使用 TensorRT 執行階段執行推論。

想要最佳化在 TensorFlow 中建置的模型，僅需將模型轉換成 ONNX 格式，並使用 TensorRT 中的 ONNX 剖析器剖析模型，並建立 TensorRT 引擎。圖 1 所示為概略的 ONNX 工作流程。

Diagram of multiple tools sending input to ONNX, to the TensorRT Optimizer, which outputs a serialized plan file for use in the TensorRT Runtime. — *圖 1：ONNX 工作流程。*

本文章探討如何使用 ONNX 工作流程建立 TensorRT 引擎，以及如何從 TensorRT 引擎執行推論。更具體而言，我們將示範端對端推論，從 Keras 或 TensorFlow 中的模型到 ONNX，再到具有 ResNet-50、語意分割和 U-Net 網路的 TensorRT 引擎。最後，我們將說明如何在其他網路上使用此工作流程。

請下載程式碼範例，並解壓縮。您可以依據對應的 README，執行 TensorFlow 1 或 TensorFlow 2 程式碼範例。在下載檔案之後，應同時從 Cityscapes 資料集指令碼儲存庫下載 labels.py，並放入與其他指令碼相同的資料夾中。

ONNX 概述

ONNX 是機器學習和深度學習模型使用的開放格式。它可以將來自不同框架（例如 TensorFlow、PyTorch、MATLAB、Caffe、Keras）的深度學習和機器學習模型轉換成單一格式。

它定義了通用的運算子集合、通用的深度學習構件集合，以及通用的檔案格式。它提供運算圖的定義以及內建運算子。可能有一或多個輸入或輸出的 ONNX 節點清單，形成非循環圖。

ResNet ONNX 工作流程範例

在此範例中，我們示範如何在兩個不同的網路上使用 ONNX 工作流程，並建立 TensorRT 引擎。第一個網路是 ResNet-50。工作流程包含下列步驟：

將 TensorFlow/Keras 模型轉換成 .pb 檔案。
將 .pb 檔案轉換成 ONNX 格式。
建立 TensorRT 引擎。
從 TensorRT 引擎執行推論。

需求

 #IF TensorFlow 1
pip install tensorflow-gpu==1.15.0 keras==2.3.1
#IF TensorFlow 2
pip install tensorflow-gpu==2.4.0 keras==2.4.3

#Other requirements
pip install -U keras2onnx tf2onnx==1.8.2 pillow pycuda scikit-image

#Check installation
python -c "import keras;print(keras.version)"

將模型轉換成 .pb

第一步是將模型轉換成 .pb 檔案。以下程式碼範例可以將 ResNet-50 模型轉換成 .pb 檔案：

import tensorflow as tf
import keras
from tensorflow.keras.models import Model
import keras.backend as K
K.set_learning_phase(0)

def keras_to_pb(model, output_filename, output_node_names):

   """
   This is the function to convert the Keras model to pb.

   Args:
      model: The Keras model.
      output_filename: The output .pb file name.
      output_node_names: The output nodes of the network. If None, then
      the function gets the last layer name as the output node.
   """

   # Get the names of the input and output nodes.
   in_name = model.layers[0].get_output_at(0).name.split(':')[0]

   if output_node_names is None:
       output_node_names = [model.layers[-1].get_output_at(0).name.split(':')[0]]

   sess = keras.backend.get_session()

   # The TensorFlow freeze_graph expects a comma-separated string of output node names.
   output_node_names_tf = ','.join(output_node_names)

   frozen_graph_def = tf.graph_util.convert_variables_to_constants(
       sess,
       sess.graph_def,
       output_node_names)

   sess.close()
   wkdir = ''
   tf.train.write_graph(frozen_graph_def, wkdir, output_filename, as_text=False)

   return in_name, output_node_names

# load the ResNet-50 model pretrained on imagenet
model = keras.applications.resnet.ResNet50(include_top=True, weights='imagenet', input_tensor=None, input_shape=None, pooling=None, classes=1000)

# Convert the Keras ResNet-50 model to a .pb file
in_tensor_name, out_tensor_names = keras_to_pb(model, "models/resnet50.pb", None)

除 Keras 外，您也可以從下列位置下載 ResNet-50：

深度學習範例 GitHub 儲存庫：提供最新的深度學習範例網路。您也可以參考 ResNet-50 分支，包含訓練 ResNet-50 v1.5 模型使用的指令碼和配方。
NVIDIA NGC 模型：其中有預先訓練模型的檢查點清單。例如，在 ResNet-50v1.5 上搜尋 TensorFlow，並從下載頁面取得最新檢查點。

將 .pb 檔案轉換成 ONNX

第二步是將 .pb 模型轉換成 ONNX 格式。請先安裝 tf2onnx。

在安裝 tf2onnx 之後，將模型從 .pb 檔案轉換成 ONNX 格式，有兩種方式可以使用。第一種方式是使用命令列，第二種方式是使用 Python API。執行以下命令：

python -m tf2onnx.convert  --input /Path/to/resnet50.pb --inputs input_1:0 --outputs probs/Softmax:0 --output resnet50.onnx

從 ONNX 建立 TensorRT 引擎

想要從 ONNX 檔案建立 TensorRT 引擎時，請執行以下命令：

import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def build_engine(onnx_path, shape = [1,224,224,3]):

   """
   This is the function to create the TensorRT engine
   Args:
      onnx_path : Path to onnx_file. 
      shape : Shape of the input of the ONNX file. 
  """
   with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, builder.create_builder_config() as config, trt.OnnxParser(network, TRT_LOGGER) as parser:
       config.max_workspace_size = (256 << 20)
       with open(onnx_path, 'rb') as model:
           parser.parse(model.read())
       network.get_input(0).shape = shape
       engine = builder.build_engine(network, config)
       return engine

def save_engine(engine, file_name):
   buf = engine.serialize()
   with open(file_name, 'wb') as f:
       f.write(buf)
def load_engine(trt_runtime, plan_path):
   with open(plan_path, 'rb') as f:
       engine_data = f.read()
   engine = trt_runtime.deserialize_cuda_engine(engine_data)
   return engine

此程式碼應儲存在 engine.py 檔案中，本文章會在稍後使用。

此程式碼範例包含以下變數：

max_workspace_size：ICudaEngine 在執行時間可以使用的最大 GPU 暫存記憶體。

建構器建立一個空網路（builder.create_network()），以及 ONNX 剖析器將 ONNX 檔案剖析至網路（parser.parse(model.read())）。在為網路設定輸入形狀（network.get_input(0).shape = shape）之後，建構器會建立引擎（engine = builder.build_cuda_engine(network)）。想要建立引擎，請執行以下程式碼範例：

import engine as eng
import argparse
from onnx import ModelProto
import tensorrt as trt 
 
 engine_name = “resnet50.plan”
 onnx_path = "/path/to/onnx/result/file/"
 batch_size = 1 
 
 model = ModelProto()
 with open(onnx_path, "rb") as f:
    model.ParseFromString(f.read())
 
 d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
 d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
 d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
 shape = [batch_size , d0, d1 ,d2]
 engine = eng.build_engine(onnx_path, shape= shape)
 eng.save_engine(engine, engine_name)

在此程式碼範例中，首先從 ONNX 模型取得輸入形狀。其次是建立引擎，然後將引擎儲存在 .plan 檔案中。

從 TensorRT 引擎執行推論：

TensorRT 引擎執行推論的工作流程，如下所示：

為 GPU 中的輸入和輸出分配緩衝區。
將資料從主機複製到 GPU 中已分配的輸入緩衝區。
在 GPU 中執行推論。
將結果從 GPU 複製到主機。
視需要重塑結果。

在以下程式碼範例中會詳細解釋這些步驟。應將此程式碼儲存在 inference.py 檔案中，本文章會在稍後使用。

import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
import pycuda.autoinit 

def allocate_buffers(engine, batch_size, data_type):

   """
   This is the function to allocate buffers for input and output in the device
   Args:
      engine : The path to the TensorRT engine. 
      batch_size : The batch size for execution time.
      data_type: The type of the data for input and output, for example trt.float32. 
   
   Output:
      h_input_1: Input in the host.
      d_input_1: Input in the device. 
      h_output_1: Output in the host. 
      d_output_1: Output in the device. 
      stream: CUDA stream.

   """

   # Determine dimensions and create page-locked memory buffers (which won't be swapped to disk) to hold host inputs/outputs.
   h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
   h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))
   # Allocate device memory for inputs and outputs.
   d_input_1 = cuda.mem_alloc(h_input_1.nbytes)

   d_output = cuda.mem_alloc(h_output.nbytes)
   # Create a stream in which to copy inputs/outputs and run inference.
   stream = cuda.Stream()
   return h_input_1, d_input_1, h_output, d_output, stream 

def load_images_to_buffer(pics, pagelocked_buffer):
   preprocessed = np.asarray(pics).ravel()
   np.copyto(pagelocked_buffer, preprocessed) 

def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):
   """
   This is the function to run the inference
   Args:
      engine : Path to the TensorRT engine 
      pics_1 : Input images to the model.  
      h_input_1: Input in the host         
      d_input_1: Input in the device 
      h_output_1: Output in the host 
      d_output_1: Output in the device 
      stream: CUDA stream
      batch_size : Batch size for execution time
      height: Height of the output image
      width: Width of the output image
   
   Output:
      The list of output images

   """

   load_images_to_buffer(pics_1, h_input_1)

   with engine.create_execution_context() as context:
       # Transfer input data to the GPU.
       cuda.memcpy_htod_async(d_input_1, h_input_1, stream)

       # Run inference.

       context.profiler = trt.Profiler()
       context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])

       # Transfer predictions back from the GPU.
       cuda.memcpy_dtoh_async(h_output, d_output, stream)
       # Synchronize the stream
       stream.synchronize()
       # Return the host output.
       out = h_output.reshape((batch_size,-1, height, width))
       return out

前兩行是確定輸入和輸出的維度。在主機中建立頁面鎖定記憶體緩衝區（h_input_1, h_output）。然後，為輸入和輸出分配與主機輸入和輸出相同大小的裝置記憶體（d_input_1, d_output）。下一步是建立 CUDA 資料流，以便在裝置和主機的已分配記憶體之間複製資料。

在此程式碼範例的 do_inference 函式中，第一步是使用 load_images_to_buffer 函式，將影像載入至主機中的緩衝區。然後，將輸入資料傳輸至 GPU（cuda.memcpy_htod_async(d_input_1, h_input_1, stream)），並使用 context.execute 執行推論。最後，將結果從 GPU 複製到主機（cuda.memcpy_dtoh_async(h_output, d_output, stream)）。

語意分割 ONNX 工作流程範例

在 Fast INT8 Inference for Autonomous Vehicles with TensorRT 3 一文中，作者介紹了語意分割模型的 UFF 工作流程。

本文章是使用類似的網路，執行語意分割 ONNX 工作流程。網路是由以 VGG16 為基礎的編碼器和三個使用反卷積層建置的上取樣層組成。在 Cityscapes 資料集上進行大約 40,000 次迭代，以訓練網路。

將 TensorFlow 模型轉換成 ONNX 檔案的方式有很多種。其中之一是 ResNet50 一節中解釋的方式。Keras 也擁有本身的 Keras 轉 ONNX 檔案轉換器。有時候，TensorFlow 轉 ONNX 不支援某些層，但是 Keras 轉 ONNX 轉換器支援。視 Keras 框架和使用的層類型而定，可能必須在轉換器之間選擇。

在以下程式碼範例中，是使用 Keras 轉 ONNX 轉換器直接將 Keras 模型轉換成 ONNX。請下載經過預先訓練的語意分割檔案 semantic_segmentation.hdf5。

import keras
import tensorflow as tf
from keras2onnx import convert_keras

def keras_to_onnx(model, output_filename):
   onnx = convert_keras(model, output_filename)
   with open(output_filename, "wb") as f:
       f.write(onnx.SerializeToString())

semantic_model = keras.models.load_model('/path/to/semantic_segmentation.hdf5')
keras_to_onnx(semantic_model, 'semantic_segmentation.onnx')

圖 2 所示為網路架構。

如同上一個範例，使用以下程式碼範例，建立語意分割引擎。

import engine as eng
from onnx import ModelProto
import tensorrt as trt


engine_name = 'semantic.plan'
onnx_path = "semantic.onnx"
batch_size = 1

model = ModelProto()
with open(onnx_path, "rb") as f:
  model.ParseFromString(f.read())

d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
shape = [batch_size , d0, d1 ,d2]
engine = eng.build_engine(onnx_path, shape= shape)
eng.save_engine(engine, engine_name)

想要測試模型的輸出時，請使用 Cityscapes 資料集。想要使用 Cityscapes 時，必須具備以下函式：sub_mean_chw 和 color_map。

在以下程式碼範例中，sub_mean_chw 從影像中減去平均值，以做為預處理步驟，以及 color_map 是從類別 ID 到色彩的對映。後者是使用於視覺化。

import numpy as np
from PIL import Image
import tensorrt as trt
import labels  # from cityscapes evaluation script
import skimage.transform

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)

MEAN = (71.60167789, 82.09696889, 72.30508881)
CLASSES = 20

HEIGHT = 512
WIDTH = 1024

def sub_mean_chw(data):
   data = data.transpose((1, 2, 0))  # CHW -> HWC
   data -= np.array(MEAN)  # Broadcast subtract
   data = data.transpose((2, 0, 1))  # HWC -> CHW
   return data

def rescale_image(image, output_shape, order=1):
   image = skimage.transform.resize(image, output_shape,
               order=order, preserve_range=True, mode='reflect')
   return image

def color_map(output):
   output = output.reshape(CLASSES, HEIGHT, WIDTH)
   out_col = np.zeros(shape=(HEIGHT, WIDTH), dtype=(np.uint8, 3))
   for x in range(WIDTH):
       for y in range(HEIGHT):

           if (np.argmax(output[:, y, x] )== 19):
               out_col[y,x] = (0, 0, 0)
           else:
               out_col[y, x] = labels.id2label[labels.trainId2label[np.argmax(output[:, y, x])].id].color
   return out_col

以下程式碼範例是上一個範例的其餘程式碼。必須先執行上一個區塊，因為需要已定義函式。使用範例比較 Keras 模型與 TensorRT 引擎語意 .plan 檔案的輸出，然後將兩個輸出視覺化。視需要替換預留位置 /path/to/semantic_segmentation.hdf5 和 input_file_path。

import engine as eng
import inference as inf
import keras
import tensorrt as trt 

input_file_path = ‘munster_000172_000019_leftImg8bit.png’
onnx_file = "semantic.onnx"
serialized_plan_fp32 = "semantic.plan"
HEIGHT = 512
WIDTH = 1024

image = np.asarray(Image.open(input_file_path))
img = rescale_image(image, (512, 1024),order=1)
im = np.array(img, dtype=np.float32, order='C')
im = im.transpose((2, 0, 1))
im = sub_mean_chw(im)

engine = eng.load_engine(trt_runtime, serialized_plan_fp32)
h_input, d_input, h_output, d_output, stream = inf.allocate_buffers(engine, 1, trt.float32)
out = inf.do_inference(engine, im, h_input, d_input, h_output, d_output, stream, 1, HEIGHT, WIDTH)
out = color_map(out)

colorImage_trt = Image.fromarray(out.astype(np.uint8))
colorImage_trt.save(“trt_output.png”)

semantic_model = keras.models.load_model('/path/to/semantic_segmentation.hdf5')
out_keras= semantic_model.predict(im.reshape(-1, 3, HEIGHT, WIDTH))

out_keras = color_map(out_keras)
colorImage_k = Image.fromarray(out_keras.astype(np.uint8))
colorImage_k.save(“keras_output.png”)

圖 3 為實際影像和基準真相，以及 Keras 的輸出與 TensorRT 引擎的輸出。如您所見，TensorRT 引擎的輸出與 Keras 的輸出類似。

在其他網路上嘗試

現在可以在其他網路上嘗試 ONNX 工作流程。欲深入瞭解分割網路的良好範例，請參閱 GitHub 上的 Segmentation models with pretrained backbones。

我們舉例示範如何將 ONNX 工作流程與其他網路搭配使用。此範例中的網路，是來自於 segmentation_models 函式庫的 U-Net。我們僅載入模型，未進行訓練。您可能需要在選擇的資料集上訓練這些模型。

這些網路的重點之一，是在載入這些網路時，它們的輸入層大小如下：（None, None, None, 3）。想要建立 TensorRT 引擎，需要具有已知輸入大小的 ONNX 檔案。在將此模型轉換成 ONNX 之前，請分配輸入大小，以變更網路，然後將其轉換成 ONNX 格式。

例如，載入此函式庫（segmentation_models）的 U-Net 網路，並分配輸入大小（244, 244, 3）。在建立 TensorRT 推論引擎之後，進行與語意分割類似的轉換。視應用程式和資料集而定，可能必須具有不同的色彩對映。

# Requirement for TensorFlow 2
pip install tensorflow-gpu==2.1.0
 
# Other requirements
pip install -U segmentation-models
import segmentation_models as sm
import keras
from keras2onnx import convert_keras
from engine import *
onnx_path = 'unet.onnx'
engine_name = 'unet.plan'
batch_size = 1
CHANNEL = 3
HEIGHT = 224
WIDTH = 224

model = sm.Unet()
model._layers[0].batch_input_shape = (None, 224,224,3)
model = keras.models.clone_model(model)

onx = convert_keras(model, onnx_path)
with open(onnx_path, "wb") as f:
  f.write(onx.SerializeToString())

shape = [batch_size , HEIGHT, WIDTH, CHANNEL]
engine = build_engine(onnx_path, shape= shape)
save_engine(engine, engine_name)

本文章在之前的內容中提到，下載預先訓練模型的另一種方式是從 NVIDIA NGC 模型下載。其具有預先訓練模型的檢查點清單。例如，可以搜尋 UNet for TensorFlow，然後前往下載頁面取得最新檢查點。

結論

本文章舉出數個範例，說明如何使用 TensorFlow-ONNX-TensorRT 工作流程部署深度學習應用程式。第一個範例是 ResNet-50 上的 ONNX-TensorRT，第二個範例是在 Cityscapes 資料集上訓練以 VGG16 為基礎的語意分割。在文章的最後，將會示範如何在其他網路上套用此工作流程。若想瞭解更多訓練與推論效能最佳化資訊，請參閱 NVIDIA 資料中心深度學習產品效能。