Accelerating AI/Deep learning models and optimising server resources using TensorRT and Triton Inference Server
22 January 2021
Written by Hung Minh Nguyen, AI-Data Engineer
Model Serving problem
AI/Deep learning models have been dominating in many services that required human intervention. However, to bring AI services online, engineers may encounter many issues. Model maintainability, efficiency, security, and reusability are usually common problems in production deployment. Firstly, researchers tend to use different frameworks (Pytorch, Tensorflow, MXNet) and various SOTA networks (Resnet, Transformer, RNN...) to achieve the accuracy as high as possible. Moreover, this may cause maintainability troubles for engineers in model integration. Secondly, most of the time, an end-to-end AI service is a combination of several models. When services need to be deployed multiple times due to service requirements (such as QPS and latency), this often leads to GPU usage efficiency and model reusability issues. In this blog, we will discuss how to solve these problems by using TensorRT and Triton Inference Server (Triton), and how we serve our AI/deep learning models in production.
TensorRT is NVIDIA's parallel programming language built on CUDA. It optimises inference time and GPU usage for AI/Deep learning models trained by other frameworks (Pytorch, TensorFlow, MXNet). TensorRT also supports different precision inference INT8 and FP16 that can reduce the latency.
TensorRT and Pytorch benchmark (batch size 32)
Here we compared the inference time and GPU memory usage between Pytorch and TensorRT. TensorRT outperformed Pytorch in terms of the inference time and GPU memory usage of the model inference using (smaller means better). We used the DGX V100 server to run this benchmark.
Triton Inference Server
There are a couple of available model-serving solutions including TFServing and TorchServe (still experimental at this time). Each of them, however, only focuses on Tensorflow or Pytorch model deployment, respectively. We have chosenTriton, developed by NVIDIA, as our model serving due to its features below:
Triton supports hosting/deploying models from different frameworks (TensorRT, ONNX, Pytorch, TensorFlow) and provides a standard model inference API which eases maintainability effort for engineers.
Dynamic batching inference
Model inference in parallel (concurrency): different deployed model instances can run in parallel. * Model reusability and microservice: different clients/services can share a single model.
Model repository: model files can be stored on the cloud (AWS s3, Google Cloud Storage) or local file system. Even if a model is running on Triton, Triton can still load the new model or new configuration updated from the model repository. This also provides better model security and a better model upgrade mechanism.
Server monitoring: statistic data of GPU and requests are provided in Prometheus data format.
Two main features that significantly optimise GPU usage are batching inference and concurrency.
With Triton, individual requests are batched and executed together, which improves throughput remarkably but does not pay off too much latency. As we can see below, when we compare batch 8 to batch 1, the QPS improves four times while a little extra latency (from less than 1ms to less than 2ms) needs to be paid. Another point we note that the QPS will not improve after the batch size reaches a threshold. However, the latency will increase instead. From our example, even we increase the batch size to bigger than 16, the QPS is still at about 14000 while the latency keeps increasing along with batch size.
preferred_batch_size: indicates which batch size the inference server should attempt to create before fetching data to the model to run.
max_queue_delay_microseconds: changes the behaviour of inference based on the frequency of incoming requests. If new requests come in and the inference server can form a preferred batch size, the data will be immediately sent to the model to execute. If no preferred batch size can be created within `max_queue_delay_microseconds`, the server will only execute even though the data is not a preferred batch size.
For end-to-end services, concurrency model inference means that different requests can be processed in parallel because one request does not wait for another request to complete the entire pipeline first. This feature allows us to maximise the usage of GPU on production servers.
Concurrency in OCR service
As shown in the figure above, different incoming requests can be processed at the same time in the OCR pipeline, which contains three models (Card Detection --> Text Detection --> Text Recognition). Some will be at the card detection model, and some will be already at the text detection or text recognition model. All these three models are able to run simultaneously on GPU. Therefore, the GPU can be utilised very well when hosting multiple models on Triton.
How much TensorRT and Triton accelerate our AI services
To demonstrate the efficiency of services using Triton as model serving over services without model serving, we have done several benchmarks to compare the QPS and GPU usage as shown below:
End-to-end services benchmark on server T4
By using TensorRT and Triton, we have improved almost four times QPS for all of the services. OCR QPS improved from 7 to almost 31 and Face comparison QPS increased from 6 to 19 while using a similar or smaller amount of GPU memory.
For latency, the overall latency of services served by TensorRT and Triton has been improved by about 20%, according to our cloud services as monitored.
We mainly cover the efficiency aspect of using TensorRT and Triton as model-serving for our services in this blog. There are still a number of aspects we have not discussed such as comparing Tensorflow performance with TensorRT (we believe Tensorflow performance should be no much different from Pytorch), CPU and RAM utilisation when using Triton, or how to convert the trained models to TensorRT, maintain, reuse, and secure those models using Triton in production. As we are also building internal tools to achieve these goals, we hope we will write more blogs about these tools.