Blog

Quantizing Whisper Base to INT8 and Deploying It on iOS Without a Framework

whisperonnxquantizationiosflutteron-device-mlasr

How I exported Whisper Base from Hugging Face, quantized it to INT8 with onnxruntime.quantization, and built the full inference pipeline in Dart — mel spectrogram, tokenizer, autoregressive decoder, token streaming — without sherpa-onnx or any abstraction layer.

Read article

ONNX: The Universal Runtime That Makes Edge AI Real

onnxedge-aideploymentonnxruntimequantizationinference-optimization

A deep dive into ONNX — the open format and runtime that decouples model training from deployment, enabling the same model to run on CPUs, GPUs, mobile chips, and browsers without rewriting a line of inference code.

Read article