If you have ever trained a model in PyTorch, hit production constraints on an embedded device, and then spent a week wrestling with framework-specific deployment tooling — you already understand the problem ONNX exists to solve.
ONNX is not just a model format. It is a deployment philosophy: decouple training from inference. Build in whatever framework fits your research workflow, then deploy anywhere the ONNX Runtime runs. In practice, that list is long enough to cover most production targets you will encounter: server-side CPU, NVIDIA GPU, Intel CPU with OpenVINO, ARM-based edge chips, iOS, Android, and the browser.
This post is a deep dive into ONNX — what it is, how its runtime works, what makes it genuinely versatile, and what I have learned from using it in real production systems.
What ONNX Actually Is
ONNX stands for Open Neural Network Exchange. It started as a collaboration between Facebook and Microsoft in 2017, motivated by a straightforward frustration: model authors needed to move models between frameworks (PyTorch, Caffe2, MXNet) without losing performance or rewriting inference code.
The format is a protobuf-serialized computational graph — a directed acyclic graph (DAG) where nodes are operators and edges carry tensor values. The ONNX specification defines a fixed set of operators (the opset) that covers the vast majority of operations used in modern neural networks: convolutions, attention mechanisms, normalization layers, non-linear activations, pooling, reshape, gather, scatter, and many more.
When you export a PyTorch model to ONNX, what you get is a .onnx file that encodes:
- The computation graph (which ops, in what order, with what connections)
- The operator versions (opset)
- The tensor shapes (static or dynamic)
- The learned weights (embedded or external)
This representation is framework-agnostic. ONNX Runtime does not know or care whether your model was trained in PyTorch, TensorFlow, Keras, or JAX. It sees only the graph and runs it.
The ONNX Ecosystem
Understanding ONNX requires understanding three distinct layers that often get conflated:
1. The ONNX Format (.onnx)
The serialization standard. This is what you export and distribute. It lives in the onnx/onnx repository and is maintained by a broad consortium of hardware and software vendors. If a tool claims ONNX support, it means it can read or write this format.
2. ONNX Runtime (ORT)
Microsoft's high-performance inference engine for ONNX models. This is a separate project from the format itself. ORT handles graph optimization, memory planning, operator kernel selection, and hardware dispatch. It is what you actually run in production.
3. Execution Providers (EPs)
The hardware abstraction layer inside ORT. Each EP is a plugin that dispatches supported operators to a specific hardware backend. If an operator is not supported by the active EP, ORT falls back to the CPU EP automatically.
The available EPs include:
| Execution Provider | Target Hardware |
|---|---|
| CPU EP | All platforms, default fallback |
| CUDA EP | NVIDIA GPUs |
| TensorRT EP | NVIDIA GPUs (layer-fused, FP16/INT8) |
| DirectML EP | Windows GPU (AMD, Intel, NVIDIA via DirectX) |
| OpenVINO EP | Intel CPUs, iGPUs, VPUs |
| CoreML EP | Apple Silicon and iOS Neural Engine |
| NNAPI EP | Android neural accelerators |
| ROCm EP | AMD GPUs |
| XNNPACK EP | ARM/x86 mobile CPUs (WebAssembly-friendly) |
| QNN EP | Qualcomm neural processors |
This table is where ONNX's versatility becomes concrete. You write one inference call — session.run(...) — and ORT routes computation to whatever hardware is available and supported, at whatever precision makes sense. No rewriting, no framework switching.
From Training to Deployment: The ONNX Conversion Pipeline
The typical path from a PyTorch model to a deployed ONNX artifact looks like this:
import torch
import torch.onnx
model = MyModel()
model.load_state_dict(torch.load("weights.pt"))
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=17,
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"},
},
do_constant_folding=True,
)
A few things matter here:
opset_version: The ONNX operator specification is versioned. Higher opsets support more operators and have better coverage for newer architectures. For production, opset 17 or 18 is a safe target as of 2026.
dynamic_axes: By default, ONNX export traces the model with fixed tensor shapes. If you need variable batch size or sequence length at runtime, declare those axes as dynamic. Missing this for a sequence model will produce a graph that silently fails or errors on inputs of unexpected length.
do_constant_folding=True: Folds constant subexpressions at export time, reducing the graph size and eliminating redundant computation before the model even reaches the runtime.
Verifying the Export
Always validate the exported graph:
import onnx
model_proto = onnx.load("model.onnx")
onnx.checker.check_model(model_proto)
print(onnx.helper.printable_graph(model_proto.graph))
And numerically validate outputs match the original:
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
input_np = dummy_input.numpy()
ort_out = sess.run(None, {"input": input_np})
torch_out = model(dummy_input).detach().numpy()
np.testing.assert_allclose(torch_out, ort_out[0], rtol=1e-3, atol=1e-5)
Numerical drift after ONNX conversion is rare but happens, especially with custom ops or non-standard layers. Running this check before any optimization pass catches issues early.
Graph Optimization in ONNX Runtime
Once you have a valid .onnx file, ORT applies a multi-level graph optimization pipeline before running any inference:
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(
"model.onnx",
sess_options=sess_options,
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
The optimization levels are:
- ORT_DISABLE_ALL: No optimization. Useful for debugging graph correctness.
- ORT_ENABLE_BASIC: Constant folding, redundant node elimination, identity elimination.
- ORT_ENABLE_EXTENDED: Fuses common operator sequences — for example, BatchNorm+Relu, Conv+Bias+Relu, Attention+LayerNorm.
- ORT_ENABLE_ALL: All of the above plus layout optimization and hardware-specific graph rewrites.
You can also save the optimized graph to disk to skip re-optimization on every startup:
sess_options.optimized_model_filepath = "model_optimized.onnx"
For production deployments where startup latency matters — edge devices, mobile apps, latency-sensitive APIs — pre-saving the optimized graph and loading it directly is worthwhile.
Quantization: Where the Real Speedups Come From
ONNX's quantization tooling is where most of the meaningful inference performance gains happen in practice. Quantization reduces weight and activation precision from FP32 to INT8 (or INT4), which compresses the model and enables integer arithmetic units on compatible hardware.
There are two approaches:
Post-Training Quantization (PTQ)
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
"model.onnx",
"model_int8.onnx",
weight_type=QuantType.QInt8,
)
Dynamic quantization applies only to weights. Activations are quantized at runtime per-tensor. This requires no calibration data and is the fastest path to a quantized model. It works well for transformer-based models where weight-dominated layers (linear projections, embeddings) dominate runtime.
Static Quantization with Calibration
For more aggressive optimization, especially on edge hardware that requires fixed quantization parameters:
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType
class MyCalibrationReader(CalibrationDataReader):
def __init__(self, data_loader):
self.data_loader = iter(data_loader)
def get_next(self):
try:
batch = next(self.data_loader)
return {"input": batch.numpy()}
except StopIteration:
return None
quantize_static(
"model.onnx",
"model_int8_static.onnx",
calibration_data_reader=MyCalibrationReader(calibration_loader),
weight_type=QuantType.QInt8,
)
Static quantization computes quantization parameters (scale and zero point) from representative data at calibration time. Activations are quantized using fixed parameters, which enables full INT8 pipeline on hardware that supports it — ONNX Runtime's QNN EP, TensorRT, and Android NNAPI can all leverage this for significant speedups.
In my work on medical image segmentation pipelines, the combination of graph optimization and INT8 quantization contributed substantially to bringing inference time from 2-3 minutes down to 4-10 seconds. The 95% reduction was not from any single technique — it was the stack: ONNX conversion, graph-level fusion, quantization, and appropriate execution provider selection working together.
ONNX Runtime on the Edge: Execution Provider Deep Dive
CPU EP: The Universal Baseline
Every deployment starts with CPU EP as the fallback. ORT's CPU kernels are not naive — they use MLAS (Microsoft Linear Algebra Subprograms), a custom BLAS-like library that exploits SIMD instruction sets (AVX2, AVX-512 on x86; NEON on ARM) through runtime dispatch.
For ARM-based edge devices (Raspberry Pi, Jetson Nano, custom embedded boards), the CPU EP with INT8 static quantization often delivers surprisingly capable performance, especially for smaller models. Combined with the XNNPACK EP — which uses Google's XNNPACK library, optimized for mobile and embedded ARM — you get an additional boost on supported operator subsets.
TensorRT EP: Maximum Throughput on NVIDIA Edge Hardware
On devices like the NVIDIA Jetson family (Orin, Xavier, Nano), the TensorRT EP is the right choice for maximum throughput. TensorRT performs layer fusion, kernel auto-tuning, and FP16/INT8 optimization at the graph level, often achieving 2-5x throughput improvement over the CUDA EP alone.
providers = [
(
"TensorrtExecutionProvider",
{
"trt_max_workspace_size": 2147483648, # 2 GB
"trt_fp16_enable": True,
"trt_engine_cache_enable": True,
"trt_engine_cache_path": "./trt_cache",
},
),
"CUDAExecutionProvider",
"CPUExecutionProvider",
]
sess = ort.InferenceSession("model.onnx", providers=providers)
The trt_engine_cache_enable flag is important for production: TensorRT's engine compilation is expensive (can take minutes on the first run). Caching the compiled engine to disk means subsequent startups are fast.
CoreML EP: Apple Silicon and iOS Neural Engine
For iOS deployment, the CoreML EP is the path to the Apple Neural Engine (ANE) — the dedicated ML accelerator in iPhone and iPad chips. ORT translates the ONNX graph to CoreML format internally, dispatching supported ops to the ANE while falling back to CPU for unsupported operators.
providers = [
(
"CoreMLExecutionProvider",
{
"coreml_flags": 0, # CPU_ONLY=1, ENABLE_ON_SUBGRAPH=2
},
),
"CPUExecutionProvider",
]
On an iPhone 15 Pro, the ANE can deliver substantial inference speedups for vision models compared to CPU, with much better power efficiency — critical for always-on health monitoring use cases like the posture prediction pipeline I worked on.
NNAPI EP: Android Acceleration
For Android deployment, the NNAPI EP routes supported operators to the Android Neural Networks API, which delegates to whatever hardware accelerator is available — a dedicated NPU, the GPU, or a DSP depending on the device SoC. This is relevant for models deployed through React Native or native Android apps.
The NNAPI EP requires static INT8 quantization for maximum hardware utilization. Models quantized with the static pipeline described above typically achieve the best acceleration.
ONNX in the Browser: onnxruntime-web
One aspect of ONNX's versatility that is easy to underestimate: it runs in the browser, via WebAssembly and WebGL backends.
import * as ort from 'onnxruntime-web';
const session = await ort.InferenceSession.create('model.onnx', {
executionProviders: ['webgl'],
});
const inputTensor = new ort.Tensor('float32', inputData, [1, 3, 224, 224]);
const results = await session.run({ input: inputTensor });
The webgl backend uses WebGL shaders for GPU acceleration where available, with WASM fallback for CPU. A newer webgpu backend — targeting the WebGPU API — is now available in Chromium-based browsers and delivers significantly better performance for compute-heavy models.
This matters for edge-adjacent deployments: running a vision model client-side in a web browser means no server round-trip, better user privacy, and offline capability. The constraint is model size and the performance ceiling of browser sandboxing, but for models under ~50 MB with INT8 quantization, this is a practical deployment target.
Real Experience: What ONNX Made Possible
In my work deploying AI systems in healthcare, ONNX was repeatedly the practical bridge between a model that worked in a research environment and one that shipped in a product.
Medical image segmentation: A TotalSegmentator-based pipeline running in a cloud-connected radiology viewer. The original PyTorch inference was too slow for interactive use. Exporting to ONNX, applying graph optimization with ORT_ENABLE_ALL, and running with the TensorRT EP on a server-side GPU brought segmentation time from minutes to seconds — a difference that directly affected whether radiologists would actually use the tool.
Real-time posture monitoring on mobile: A CNN model consuming IMU sensor data from a wearable device, running locally inside a React Native app via OnnxRuntime React Native bindings. The model ran entirely on-device, with predictions streamed to a cloud backend for workflow orchestration. ONNX provided the path from a PyTorch-trained model to a mobile runtime without needing to rebuild the model in TensorFlow Lite or rewrite inference in CoreML's native SDK.
LLM serving at the edge: Smaller language models like Whisper and quantized variants of instruction-tuned models exported to ONNX and served via ORT on edge servers. The key advantage here was the ability to use INT4/INT8 quantization via tools like olive (Microsoft's optimization toolkit for ORT) to reduce memory footprint enough to run on devices without large VRAM.
The pattern across these cases is consistent: ONNX was not the only thing that mattered, but it was the enabling layer that made the other optimizations composable.
The ONNX Optimization Toolkit: Beyond Base ORT
Several tools build on ONNX Runtime and extend its optimization capabilities:
Olive (Microsoft)
Olive is Microsoft's hardware-aware model optimization toolkit that targets ORT deployment. It automates the optimization pipeline — graph transformation, quantization, tuning for specific EPs — through a declarative configuration:
{
"input_model": { "type": "PyTorchModel", "config": { "model_path": "model.pt" } },
"systems": {
"local_system": {
"type": "LocalSystem",
"config": { "accelerators": [{ "device": "gpu", "execution_providers": ["CUDAExecutionProvider"] }] }
}
},
"passes": {
"conversion": { "type": "OnnxConversion" },
"transformers_optimization": { "type": "OrtTransformersOptimization" },
"quantization": { "type": "OnnxDynamicQuantization" }
},
"evaluators": { "common_evaluator": { "metrics": [{ "name": "latency", "type": "latency" }] } }
}
Olive handles the tedious parts of the optimization pipeline — finding the best quantization config, applying transformer-specific fusions, tuning for EP-specific constraints — and outputs a deployment-ready artifact.
ONNX Model Zoo
The ONNX Model Zoo provides pre-exported, pre-optimized versions of many standard architectures: EfficientNet, BERT, ResNet, Whisper, YOLO variants, and others. If your deployment target uses one of these architectures, starting from the Zoo version can save significant export and validation time.
Netron
Netron is an indispensable browser-based ONNX model visualizer. Loading a .onnx file gives you a full interactive view of the computation graph — operator types, tensor shapes, weight statistics. Essential for debugging export failures, verifying quantization was applied correctly, and understanding why a particular EP is falling back to CPU.
When ONNX Is the Wrong Choice
Honest deployment engineering requires knowing when not to use a tool.
When you need extreme hardware-specific performance: If your entire stack is NVIDIA hardware and you need every bit of throughput, going directly to TensorRT without the ONNX Runtime intermediary may be faster. The TRT EP adds some overhead and some ops are not supported, forcing CPU fallback.
When the model uses unsupported custom ops: ONNX has broad operator coverage, but it does not cover everything. Custom CUDA kernels, research-specific ops, or very new architectural components may not have an ONNX op. You can register custom ops with ORT, but this adds engineering overhead and breaks the portability guarantee.
When the target is Apple-first: If you are building exclusively for iOS and macOS and performance on Apple Silicon is the only target, the native CoreML SDK with direct coremltools conversion from PyTorch may outperform the ORT CoreML EP because Apple's compiler has more freedom to optimize end-to-end.
When model architecture changes frequently: The ONNX export-optimize-deploy pipeline has friction. If your model architecture changes every sprint and you need to re-export, re-quantize, and revalidate every time, that pipeline cost accumulates. For rapid iteration environments, staying in PyTorch with TorchScript or dynamic quantization may be more pragmatic until the architecture stabilizes.
Key Lessons from Production ONNX Deployments
The conversion is easy; the validation is not
Exporting to ONNX takes minutes. Validating that numerical outputs are correct across all inputs, that dynamic shapes behave properly, that performance is actually better, and that the model degrades gracefully under quantization — this takes real engineering time. Budget for it.
Execution provider priority order matters
ORT uses the first EP in your priority list that supports each operator. Get the list wrong and you may be running on CPU without realizing it. Always log the active EP after session creation:
print(sess.get_providers()) # actual, not requested
Quantization affects accuracy non-uniformly
INT8 quantization has different impacts on different model families. Transformer-based models typically tolerate dynamic quantization well. CNN-based segmentation models may lose meaningful accuracy at INT8 and need per-layer calibration or mixed-precision (some layers at FP32, others at INT8). Test on your actual evaluation set, not just latency benchmarks.
Profile before optimizing
The biggest speedups come from profiling the actual bottleneck. ONNX Runtime has built-in profiling:
sess_options.enable_profiling = True
sess_options.profile_file_prefix = "ort_profile"
This generates a Chrome-trace-compatible JSON file showing time spent per operator. Often the bottleneck is not where you expect — preprocessing, I/O, or a single operator consuming most of the time while the rest of the graph is already fast.
The Bigger Picture
ONNX succeeded where earlier interoperability efforts failed because it addressed the right constraint: not framework unification (an unsolvable political and technical problem), but inference-time portability. Training stays in PyTorch or JAX or whatever the research community converges on. Deployment goes through a stable, widely supported artifact format.
The result is a compounding ecosystem effect. Hardware vendors (NVIDIA, Intel, Qualcomm, Apple) invest in ONNX Runtime EPs because the format has broad adoption. That EP investment makes ONNX a better deployment target, which drives more adoption. The opset expands to cover new architectures. More frameworks add ONNX export support.
For a practitioner building AI systems that need to run anywhere from a data center GPU to a wearable device, this ecosystem effect is real and valuable. ONNX Runtime is not always the fastest option on any specific hardware, but it is frequently the most practical — the one you can rely on to work across the full range of targets you need to support, with good-enough performance and a maintainable codebase.
That combination — broad coverage, solid performance, active maintenance, and a stable interface — is what makes ONNX the foundation of serious edge AI deployment.