By Adrien Payong and Shaoni Mukherjee
Machine learning frameworks, model tools, and deployment strategies have distinct use cases in a machine learning (ML) pipeline. They can exhibit strengths and weaknesses in each step of model development, training, optimization, deployment, and inference.
In this article, we compare five such tools and related technologies in detail: PyTorch, TensorFlow, LiteRT (formerly known as TensorFlow Lite), TensorRT, and ONNX. We review their features, strengths, weaknesses, and their role in a machine learning pipeline. We will also talk about how different tools work together and share common ways to deploy them, along with code snippets to help explain the concepts.
PyTorch is an open-source deep learning framework based on a dynamic computation graph (define-by-run) and a Pythonic programming interface. PyTorch is centered on the idea of flexibility (models can be written and debugged in a natural way, as they are normal Python code).
While PyTorch is inherently dynamic, its highly optimized C++ backend and tensor libraries (such as cuDNN for GPUs) mean that it can still achieve comparable or better performance than a static graph framework. Over time, PyTorch has evolved to support research, development, and production deployment, with features such as TorchScript (a way to serialize models for optimized inference in C++ or mobile environments) and TorchServe for model serving.
Below is an overview of where PyTorch excels and where trade-offs remain.
PyTorch is primarily used in the model development and training lifecycle stage. In the deployment stage, you can use PyTorch for inference. For example, running on a server (optionally wrapping the model with a web service such as Flask or using TorchServe), or converting the model to a lighter-weight representation for deployment.
In this code, we build a simple, fully connected neural network and train it using PyTorch’s nn.Module and optimization APIs. Thanks to PyTorch’s dynamic graph, we can express the training loop in Python where iteration is simply executing the forward pass of our model. After training, we can save the model’s weights. We may later either reload the model in PyTorch or export it to ONNX for use with other runtimes (We will see this later in this guide).
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Define a simple neural network
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(784, 128)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = x.view(-1, 784) # Flatten images
x = self.relu(self.fc1(x))
return self.fc2(x)
# Prepare the training dataset and DataLoader
transform = transforms.Compose([transforms.ToTensor()])
train_dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
# Model setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNet().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# Training loop
for epoch in range(5):
for batch_x, batch_y in train_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad()
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
print("Training completed.")
torch.save(model.state_dict(), "model_weights.pth")
TensorFlow is another popular deep learning framework, originally developed at Google. In its early versions, it introduced the concept of static computation graphs, where a model’s computation graph is first defined and then executed. This facilitated whole-model optimization and efficient deployment on diverse platforms, at the cost of some flexibility and user-friendliness. In response to community feedback (and also pressure from competition with PyTorch), TensorFlow 2.x adopted eager execution (more dynamic, similar to PyTorch) by default, while still offering users the ability to leverage the optimizations of static graphs by using the tf.function decorator and XLA compiler.
TensorFlow also supports Keras as its high-level API for model definition. It provides a rich ecosystem of tools for production. You can use it to deploy models on servers (TensorFlow Serving), on mobile/embedded devices (TensorFlow Lite, now LiteRT), in JavaScript (TensorFlow.js), or on specialized hardware (Google’s TPUs).
TensorFlow is widely used in industry for large-scale training and production deployments. Its ecosystem provides an end-to-end solution, but it comes with its own trade-offs for developers.
We can develop and train a model using TensorFlow, typically through the Keras API, export the model, and then deploy it.
The exported model can be used for inference in a server environment (using TensorFlow Serving or TensorFlow C++ API). For an edge (e.g., mobile, IoT) device, you can convert the model to LiteRT (TFLite) format and run inference on the device itself. TensorFlow provides integration with TensorRT to accelerate GPU inference (TF-TRT). One case reported 2.4× inference throughput on a ResNet-50 model using an NVIDIA T4 GPU over native TensorFlow GPU execution.
In the snippet below, we build a simple feed-forward network for image classification using Keras. We compile and train the model on x_train, y_train (prepared MNIST data in this example), and save it. The SavedModel (“model_saved.keras”) can be loaded later for inference or converted for deployment. TensorFlow’s high-level API makes it unnecessary for us to construct the low-level graph. However, under the hood, it can optimize the computation graph for performance when running in production.
import tensorflow as tf
# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Normalize the pixel values to [0, 1]
x_train = x_train / 255.0
x_test = x_test / 255.0
# Define a simple model using Keras (e.g., for MNIST classification)
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
# Compile the model with optimizer, loss, and metrics
model.compile(
optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
# Train the model on training data
model.fit(x_train, y_train, epochs=5, batch_size=32)
# Evaluate on test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"\nTest accuracy: {test_acc:.4f}")
# Save the trained model to disk (SavedModel format)
model.save("model_saved.keras")
LiteRT (Lite Runtime) is essentially a lightweight inference engine that started as TensorFlow Lite. LiteRT enables pre-trained models to run on resource-constrained devices(mobile phones, tablets, IoT, edge devices, and microcontrollers).
As originally released, it was primarily designed to support TensorFlow-authored models. However, Google’s AI Edge team also extended LiteRT to work with models authored in other frameworks. It uses conversion tools that will take a model from PyTorch, JAX, or TensorFlow and convert it into the FlatBuffers .tflite format.
LiteRT is designed for on-device inference on mobile, embedded, and edge workloads. Here is a brief overview of where it excels—and where the trade-offs come into play:
The pipeline can be described as follows: train a model in TensorFlow/PyTorch, then convert the trained model to .tflite format using the relevant converter. Finally, deploy that .tflite file within a mobile app or on an embedded device using the LiteRT runtime.
During conversion, you can apply optimizations such as quantization or pruning. This model will then be executed as part of your software using the LiteRT interpreter (which you can use with many languages – e.g., Java/Kotlin on Android, Swift on iOS, C++ for native, or Python for rapid prototyping).
It generally has significantly better on-device performance than using a full framework runtime on-device. In a single benchmark on a Samsung S21 mobile device, a baseline image classification model performed 23 ms per inference at ~89 MB memory with TensorFlow Lite. However, the same model achieved ~31 ms (112 MB) and ~38 ms (126 MB) with ONNX Runtime and PyTorch Mobile, respectively. This highlights the importance of LiteRT’s focus on low-latency, low-memory execution on mobile devices.
This example shows how to convert a TensorFlow model trained in Python to LiteRT format and execute inference.
# ready MNIST → SavedModel → LiteRT(TFLite) → Inference pipeline
import tensorflow as tf
import numpy as np
print("TF version:", tf.__version__)
# 1) Load & prep MNIST
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Normalize to [0,1]
x_train = (x_train / 255.0).astype("float32")
x_test = (x_test / 255.0).astype("float32")
# 2) Define & train a simple Keras model
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(28, 28)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dense(10) # logits
])
model.compile(optimizer="adam",
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
model.fit(x_train, y_train, epochs=2, batch_size=128, validation_split=0.1, verbose=1)
# Quick test set eval
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"TF model test accuracy: {test_acc:.4f}")
# 3) Export a TensorFlow SavedModel (needed for TFLite conversion)
# In TF 2.15+, prefer model.export(). If not available, fallback to tf.saved_model.save
if hasattr(model, "export"):
model.export("model_saved") # TF ≥ 2.15
else:
tf.saved_model.save(model, "model_saved") # Older TF fallback
# 4) Convert to LiteRT/TFLite (FP32 with default optimizations)
converter = tf.lite.TFLiteConverter.from_saved_model("model_saved")
converter.optimizations = [tf.lite.Optimize.DEFAULT] # dynamic range quantization if weights permit
tflite_model = converter.convert()
with open("model_fp32.tflite", "wb") as f:
f.write(tflite_model)
print("Wrote model_fp32.tflite")
# --- OPTIONAL: Full INT8 quantization with representative dataset ---
do_full_int8 = True
if do_full_int8:
def rep_data():
# Yield a few hundred samples to calibrate ranges
for i in range(500):
# TFLite expects a batch dimension
yield [np.expand_dims(x_train[i], 0)]
converter_int8 = tf.lite.TFLiteConverter.from_saved_model("model_saved")
converter_int8.optimizations = [tf.lite.Optimize.DEFAULT]
converter_int8.representative_dataset = rep_data
# Force int8 I/O where supported
converter_int8.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter_int8.inference_input_type = tf.int8
converter_int8.inference_output_type = tf.int8
try:
tflite_int8 = converter_int8.convert()
with open("model_int8.tflite", "wb") as f:
f.write(tflite_int8)
print("Wrote model_int8.tflite")
except Exception as e:
print("INT8 conversion fell back / failed:", e)
tflite_int8 = None
# 5) Run inference with the TFLite Interpreter (FP32 model)
import tensorflow.lite as tflite
def tflite_predict(tflite_path, image_28x28):
interpreter = tflite.Interpreter(model_path=tflite_path)
interpreter.allocate_tensors()
in_details = interpreter.get_input_details()
out_details = interpreter.get_output_details()
inp = image_28x28
# Match dtype & shape expected by the model
if in_details[0]["dtype"] == np.float32:
inp = inp.astype(np.float32)
elif in_details[0]["dtype"] == np.int8:
# Quantized model expects int8; apply quantization params
scale, zero_point = in_details[0]["quantization"]
if scale == 0:
# Safety: if no scale provided (rare), just cast
inp = inp.astype(np.int8)
else:
inp = (inp / scale + zero_point).round().astype(np.int8)
# Add batch dimension
inp = np.expand_dims(inp, 0)
interpreter.set_tensor(in_details[0]["index"], inp)
interpreter.invoke()
out = interpreter.get_tensor(out_details[0]["index"])
# If output is int8, dequantize back to float for softmax/argmax
if out_details[0]["dtype"] == np.int8:
scale, zero_point = out_details[0]["quantization"]
if scale != 0:
out = (out.astype(np.float32) - zero_point) * scale
# Convert logits to probabilities and pick class
probs = tf.nn.softmax(out, axis=-1).numpy()[0]
pred = int(np.argmax(probs))
conf = float(probs[pred])
return pred, conf
# Test on a few MNIST samples with FP32 model
for idx in [0, 1, 2]:
pred, conf = tflite_predict("model_fp32.tflite", x_test[idx])
print(f"[FP32] Sample {idx}: pred={pred}, conf={conf:.3f}, true={y_test[idx]}")
# If INT8 model exists, test it as well
if 'tflite_int8' in locals() and tflite_int8 is not None:
for idx in [0, 1, 2]:
pred, conf = tflite_predict("model_int8.tflite", x_test[idx])
print(f"[INT8] Sample {idx}: pred={pred}, conf={conf:.3f}, true={y_test[idx]}")
The code first loads and normalizes the MNIST dataset, defines and trains a small fully connected network, and evaluates accuracy. The trained model is then saved as a SavedModel, which is the input to the TFLiteConverter. The converter generates two models, the default FP32 LiteRT model and (optionally) a fully quantized INT8 model using a representative dataset to calibrate ranges. Finally, the code defines a helper function tflite_predict() which loads a .tflite file, prepares and quantizes/dequantizes data as needed, executes inference, and returns the predicted digit and confidence. A few test samples are passed through both the FP32 and INT8 models to confirm correct deployment and show example outputs.
NVIDIA TensorRT is an SDK and runtime for low-latency and high-throughput deployment of neural networks on NVIDIA GPUs. TensorRT can be thought of as a deep learning model compiler: you supply a trained model (usually in ONNX or a framework-specific format) and it performs a series of optimizations to output an optimized inference engine that runs on the GPU.
These optimizations include**:**
The final output is a highly optimized binary that can execute the model’s forward pass significantly faster than standard implementations.
This section details TensorRT’s standout features and the trade-offs you’ll need to manage.
TensorRT is only for inference-time. For example, you train a model using PyTorch or TensorFlow on a GPU, then export it to ONNX (TensorRT supports ONNX as a standard input format). You can then call TensorRT APIs or utilities to deploy that ONNX model, build an engine (calibrating with sample data, if using int8 quantization, for INT8), and run that engine in your application.
Now load the engine on the C++ server app or Python bindings (if you have a small-scale setup). If you’re in a high-throughput server case (think serving millions of queries in production), it’s very common for TensorRT engines to be run in NVIDIA’s Triton Inference Server, which handles multiple models and concurrency.
Below is a simplified (pseudo-code) example using TensorRT’s Python API for building an engine from an ONNX model and performing inference. This provides an idea of the overall workflow, without API details.
import tensorrt as trt
onnx_file = "model.onnx"
engine_file = "model.plan" # TensorRT engine file
# Set up TensorRT logger and builder
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(flags=1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
# Parse the ONNX model to populate the TensorRT network
with open(onnx_file, "rb") as f:
parser.parse(f.read())
# (In practice, check parser.error for unsupported ops here.)
# Configure builder
builder.max_batch_size = 1
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1GB workspace for optimization
config.flags |= trt.BuilderFlag.FP16 # enable FP16 precision if supported
# Build the TensorRT engine
engine = builder.build_engine(network, config)
with open(engine_file, "wb") as f:
f.write(engine.serialize()) # save the engine to file
# Use the engine for inference
runtime = trt.Runtime(logger)
with open(engine_file, "rb") as f:
engine_bytes = f.read()
engine = runtime.deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()
# Assuming a single input and single output for simplicity
input_shape = engine.get_binding_shape(0)
output_shape = engine.get_binding_shape(1)
# Allocate device memory for inputs and outputs (using PyCUDA or similar)
# ... (omitted for brevity)
# Execute inference
context.execute_v2(bindings=[d_input_ptr, d_output_ptr])
# Copy results from device memory to host and use the output
The code above outlines the steps required to convert an ONNX model to a TensorRT engine. We start by constructing a Builder and an OnnxParser, which are used to read the model graph. Next, some builder configuration is set (such as the workspace size and enabling FP16 support). The build_engine call performs the TensorRT optimization passes and outputs an engine (serialized to “model.plan”). We can later deserialize the engine and create an execution context that can be used to run inference.
Let’s consider the steps required to use TensorRT for inference:
In real applications, one would also handle dynamic shapes or multiple bindings, but this requires a much deeper dive into TensorRT code. In practice, most developers use a high-level wrapper library or ONNX Runtime with TensorRT backend to avoid writing the low-level code themselves.
ONNX (Open Neural Network Exchange) is not a training framework; it is an open format for representing ML models that provides a runtime (ONNX Runtime) for executing models.
You can train a model in one framework (e.g., PyTorch or TensorFlow), export it to ONNX format (a computational graph with standard operations), then run it using another tool or even a different hardware backend. The decoupling of frameworks from runtime is very powerful in a production setting, where you can choose the best framework for development, then pick the best runtime for deployment.
The summary below covers areas where the stack excels and its current limitations.
ONNX is often in the middle of the pipeline. A common scenario would be to train in PyTorch and then use torch.onnx.export to export the model to model.onnx. We can then take that ONNX model and deploy it in a production service (using ONNX Runtime that may be written in C++ for efficiency, or in Python if that’s suitable).
Let’s consider that you are working with a TensorFlow model but would like to use TensorRT without going through TensorFlow’s integration. In that case, you could convert the TF model to ONNX and then hand it to TensorRT (since TensorRT also has native ONNX support).
Similarly, ONNX is used in model compression and quantization workflows. For example, you can export a model to ONNX and then perform post-training quantization on it via the tooling provided by ONNX Runtime.
End-to-end interoperability may be among the most crucial factors for expert users. None of the tools mentioned in our chart will be used in isolation in a real-world end-to-end pipeline. Select two or three tools from this list and use them in combination to meet training and deployment needs. In the following, we describe some typical use-cases and workflows, and how the pieces come together:
Pipeline | Target / Environment | Key Steps |
---|---|---|
PyTorch → ONNX → TensorRT (GPU Deployment) | NVIDIA GPU servers / edge with CUDA | Train in PyTorch. Export to ONNX (choose opset, simplify graph). Build TensorRT engine (FP16/INT8, profile). Deploy engine & serve requests. |
TensorFlow → LiteRT (Mobile Deployment) | Android & iOS (on-device) | Train in TF/Keras. Convert with LiteRT (TFLite) converter to .tflite. Bundle in app; enable delegates. (Optional) QAT via TF Model Optimization |
PyTorch → LiteRT (Direct or via ONNX) | Android & iOS (on-device) | Train in PyTorch.Convert directly to .tflite or via ONNX + TF converter. Integrate in mobile apps. |
PyTorch → ONNX → ONNX Runtime (CPU/GPU) | Windows, Linux, macOS, mobile | Train in the preferred framework. Export to ONNX. Run with ONNX Runtime (select provider per platform). |
TensorFlow → TensorRT (TF-TRT or ONNX) | NVIDIA GPU servers | Option A: TF-TRT (graph parts replaced by TRT engines; TF Serving friendly). Option B: Export ONNX → build TRT engine directly |
PyTorch and TensorFlow are the “front ends”, used to build the models; ONNX is a common “in-between” format to transfer models from one framework to the other; TensorRT and LiteRT are “end points”, each optimized for particular hardware (GPUs and edge devices, respectively).
PyTorch vs TensorFlow—when should I pick each? PyTorch is great for fast research iteration and Pythonic debugging; TensorFlow is better for end-to-end, production-grade ML pipelines (TFX, TF-Serving, TPU) and smoother enterprise operations.
What is LiteRT (formerly TFLite), and when do I use it? LiteRT is a lightweight, on-device inference runtime built for mobile/edge. Train your model with TensorFlow or PyTorch, convert to .tflite, and run with hardware delegates (NNAPI, Core ML, GPU) for low-latency, low-power inference.
How do ONNX and TensorRT work together? Export your trained model to ONNX, then use TensorRT to compile it into a highly optimized engine for NVIDIA GPUs (FP16/INT8, kernel fusion). ONNX is the bridge; TensorRT is the GPU turbocharger.
Choose tools based on where you run and how you scale: PyTorch or TensorFlow to iterate and train; ONNX to decouple training and serving; TensorRT for best NVIDIA-GPU performance; and LiteRT for small, low-latency, on-device inference. Most winning stacks are multi-framework (e.g., PyTorch → ONNX → TensorRT for GPU serving or TensorFlow → LiteRT for mobile) with a single exported artifact you can benchmark and ship.
A practical way to achieve this is through the DigitalOcean Gradient AI Platform. Boot up managed GPU notebooks, train, and deploy accelerated endpoints without managing infrastructure. This will enable you to use PyTorch/TensorFlow with ONNX, TensorRT, or LiteRT, all in a single streamlined workflow.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.