Sabber Ahamed | Geophysicist & Data Scientist

When I started building HealthOS — a local-first personal health record app — one of the first things I needed to get right was voice input.

The use case is simple: a user records a voice note describing a symptom, a medication change, a conversation with their doctor. That note gets transcribed and attached to their health record. It all happens on the device. No audio ever leaves the phone.

That privacy constraint is non-negotiable. It is also the constraint that makes this technically interesting.

This post is about how I deployed Whisper Tiny INT8 natively on iOS — using ONNX, sherpa-onnx, and a background Dart Isolate — and what it took to make transcription feel fast enough to be genuinely useful.

Why Whisper, and Why On-Device

Whisper is OpenAI's automatic speech recognition model, trained on 680,000 hours of multilingual audio. The key property for this use case is accuracy on natural, conversational speech — people describing symptoms, medication names, dates, provider names. General-purpose ASR that fails on "metformin" or "HbA1c" is not useful in a health context.

The model family ranges from Tiny (39M parameters) to Large (1.5B parameters). For on-device mobile deployment, Tiny is the practical starting point. It is fast enough to run on a phone CPU, small enough to bundle in an app asset, and accurate enough for clear speech in a quiet environment.

The alternative — cloud ASR — was never really an option. HealthOS is built on a principle I care deeply about: the user is the custodian of their health data, not a cloud service. Raw audio of someone describing their medical history should not leave their device. Full stop.

The Stack

The app is built in Flutter, targeting iOS first. The ASR pipeline uses:

record Flutter plugin: audio capture at PCM16, 16 kHz, mono
sherpa-onnx: the Flutter/Dart wrapper around the sherpa-onnx C++ library, which wraps ONNX Runtime
Whisper Tiny INT8: split encoder + decoder ONNX models, quantized to INT8
Dart Isolates: the inference runs in a persistent background isolate, keeping the UI thread free
sqflite: local SQLite for persisting transcripts alongside health records

The model assets are bundled in the Flutter app, extracted to the application support directory on first launch, and loaded there. Everything runs on the CPU via ONNX Runtime's CPU execution provider.

The Whisper ONNX Model Format

One thing that catches people out with Whisper ONNX exports: the model is not a single graph. It is split into two ONNX files:

tiny-encoder.int8.onnx — converts the mel spectrogram into hidden state representations
tiny-decoder.int8.onnx — autoregressively generates tokens from the encoder output

sherpa-onnx handles the full decode pipeline internally — mel feature extraction, tokenizer, beam search or greedy decode, detokenization back to text — so from the application side I only need to pass raw PCM samples and get a string back. That is the right level of abstraction for this use case.

The model files are distributed via the sherpa_onnx pub package and can be downloaded with a helper script:

./scripts/download_sherpa_whisper_tiny.sh

This pulls the INT8-quantized Whisper Tiny assets into assets/models/sherpa_whisper_tiny/, where Flutter bundles them into the app.

The Isolate Architecture

This is the design decision that matters most for user experience.

Whisper inference on a phone CPU is not instantaneous. Even the Tiny model takes a few hundred milliseconds to a few seconds depending on audio length and hardware. If I run that on the Flutter main isolate, the UI freezes for the duration. Buttons stop responding. Animations stutter. The app feels broken.

The fix is a persistent background Dart Isolate that owns the recognizer and handles all inference requests asynchronously. The main isolate sends audio samples; the worker sends back text. The UI never blocks.

Here is how the worker isolate entry point is structured:

void _asrWorkerEntry(_AsrIsolateConfig cfg) {
  sherpa.initBindings();

  final port = ReceivePort();
  cfg.sendPort.send(port.sendPort); // handshake

  final recognizer = sherpa.OfflineRecognizer(
    sherpa.OfflineRecognizerConfig(
      model: sherpa.OfflineModelConfig(
        whisper: sherpa.OfflineWhisperModelConfig(
          encoder: cfg.encoderPath,
          decoder: cfg.decoderPath,
          language: 'en',
          task: 'transcribe',
        ),
        tokens: cfg.tokensPath,
        numThreads: cfg.numThreads,
        modelType: 'whisper',
        debug: false,
        provider: 'cpu',
      ),
      decodingMethod: 'greedy_search',
    ),
  );

  port.listen((dynamic msg) {
    if (msg is _InferRequest) {
      final stream = recognizer.createStream();
      try {
        stream.acceptWaveform(samples: msg.samples, sampleRate: 16000);
        recognizer.decode(stream);
        final result = recognizer.getResult(stream).text.trim();
        cfg.sendPort.send(_InferResponse(msg.id, result));
      } finally {
        stream.free();
      }
    } else if (msg == 'dispose') {
      recognizer.free();
      Isolate.exit();
    }
  });
}

The recognizer is instantiated once and reused across all requests. This matters because model loading is expensive — you pay that cost once at startup, not on every transcription call.

The thread count is determined at runtime:

static int _recommendedThreadCount() {
  final cores = Platform.numberOfProcessors;
  if (cores < 2) return 2;
  if (cores > 4) return 4;
  return cores;
}

Modern iPhones have high core counts but ONNX Runtime's CPU EP scales well only up to a point for this model size. Capping at 4 threads avoids diminishing returns and thermal pressure on the efficiency cores.

The Chunking Strategy

Whisper was designed around a fixed 30-second audio window. Feed it audio shorter than that and it zero-pads up to 30 seconds, which wastes compute and can introduce hallucinations at the padding boundary. Feed it audio longer than 30 seconds and the decoder runs beyond its effective context, losing accuracy on later content.

The right approach for recordings longer than 30 seconds is chunking — split the audio into overlapping windows, transcribe each window, and merge the results.

I use 28-second chunks with a 2-second overlap:

const int chunkMs = 28000;
const int overlapMs = 2000;

The 28-second window sits comfortably inside Whisper's native context without triggering the worst of the zero-padding behavior. The 2-second overlap gives adjacent chunks shared context at their boundaries, which helps the merge step find the seam correctly.

Silence detection is applied before sending each chunk to the recognizer — no point spending inference time on silence:

if (_rms(chunk) >= _minChunkRms) {
  final text = (await _inferInWorker(chunk)).trim();
  if (text.isNotEmpty) parts.add(text);
}

The RMS threshold (0.008) is set to catch quiet speakers without over-aggressively skipping legitimate low-energy speech.

Merging adjacent chunks uses word overlap: the last few words of the previous chunk are compared against the first few words of the next chunk, and if they match, the duplicate content is deduplicated:

String _mergeByWordOverlap(String a, String b) {
  final leftTokens = a.trim().split(RegExp(r'\s+'));
  final rightTokens = b.trim().split(RegExp(r'\s+'));
  final cappedOverlap = min(max(leftTokens.length, rightTokens.length), 12);

  for (var k = cappedOverlap; k >= 1; k--) {
    final leftSuffix = leftTokens.sublist(leftTokens.length - k);
    final rightPrefix = rightTokens.sublist(0, k);
    if (_tokensMatch(leftSuffix, rightPrefix)) {
      return '${a.trim()} ${rightTokens.skip(k).join(' ')}'.trim();
    }
  }
  return '${a.trim()} ${b.trim()}';
}

This is simple and works well in practice for clean speech. Whisper's greedy decoder tends to reproduce the same words at chunk boundaries when the audio overlaps, so the match rate is high.

Model Provisioning on iOS

Models cannot be loaded from the Flutter asset bundle directly — ONNX Runtime needs a filesystem path, not a byte buffer. On first launch, I copy the model files from the asset bundle to the application support directory:

Future<void> _ensureModelFiles() async {
  final supportDir = await getApplicationSupportDirectory();
  final modelDir = Directory(
    p.join(supportDir.path, 'models', 'whisper_tiny_int8_en'),
  );
  await modelDir.create(recursive: true);

  for (final filename in _modelFiles) {
    final dest = File(p.join(modelDir.path, filename));
    if (await dest.exists()) continue;

    final data = await rootBundle.load('$_assetPrefix/$filename');
    await dest.writeAsBytes(
      data.buffer.asUint8List(data.offsetInBytes, data.lengthInBytes),
      flush: true,
    );
  }
  _modelDir = modelDir.path;
}

The if (await dest.exists()) continue check means subsequent launches skip the copy step entirely. The model files are written once and reused.

The three files for Whisper Tiny INT8 via sherpa-onnx:

tiny-encoder.int8.onnx (~15 MB)
tiny-decoder.int8.onnx (~27 MB)
tiny-tokens.txt (~800 KB)

Total model footprint on disk: roughly 43 MB. Acceptable for a health application where the data sensitivity justifies bundling the inference stack locally.

What "Near Real-Time" Actually Means Here

I want to be precise about what I mean by near real-time, because the term gets used loosely.

Whisper is an offline model — it processes a complete audio segment, not a live stream. So "real-time" here does not mean the transcript appears word by word while you speak. What it means is:

The transcription completes within a few seconds of stopping the recording.

In practice, on an iPhone 14 with Whisper Tiny INT8 and 4 threads, a 30-second voice note transcribes in roughly 2-4 seconds. That is fast enough that the user experience feels immediate: record, stop, transcript appears.

For shorter clips (5-10 seconds — a typical voice note for "took ibuprofen 400mg this morning"), transcription is under a second.

The design deliberately moved away from live streaming transcription — where partial text is shown during recording — for this use case. Whisper performs meaningfully better on complete utterances than on short chunks. A health record note needs to be accurate. The few-second post-recording wait is the right trade-off.

The Audio Pipeline

Audio is captured at 16 kHz, mono, PCM16 — exactly what Whisper expects — via the record plugin:

const config = RecordConfig(
  encoder: AudioEncoder.pcm16bits,
  sampleRate: 16000,
  numChannels: 1,
  autoGain: false,
  echoCancel: false,
  noiseSuppress: false,
  streamBufferSize: 4096,
);

I deliberately disable auto-gain, echo cancellation, and noise suppression. These iOS audio processing filters can distort or suppress the kind of speech content that matters in a health context — someone spelling a medication name, reading a lab value, describing a symptom with clinical terminology. Whisper is robust to natural recording conditions; aggressive audio processing can actually hurt accuracy by removing signal it needs.

Raw PCM chunks stream to a StreamController and simultaneously to a file sink. When the user stops recording, the accumulated PCM bytes are wrapped with a standard WAV header written in pure Dart:

Uint8List _pcm16ToWav(Uint8List pcmBytes, {required int sampleRate, required int channels}) {
  final bd = ByteData(44 + pcmBytes.length);
  // RIFF header
  writeAscii(0, 'RIFF');
  bd.setUint32(4, 36 + pcmBytes.length, Endian.little);
  writeAscii(8, 'WAVE');
  // fmt chunk
  writeAscii(12, 'fmt ');
  bd.setUint32(16, 16, Endian.little);
  bd.setUint16(20, 1, Endian.little);   // PCM
  bd.setUint16(22, channels, Endian.little);
  bd.setUint32(24, sampleRate, Endian.little);
  // ... data chunk
}

No dependency on any audio encoding library. The WAV file is both the inference input and the stored audio attachment on the health record.

Lessons from Running This on Real Devices

Model loading latency matters more than I expected. The first transcription after app launch takes longer because the recognizer initializes in the worker isolate. I added a warmup call during app startup (_asrService.warmup()) to front-load this cost while the user is still looking at the home screen.

Greedy decoding is the right choice for Tiny. Beam search improves quality on larger Whisper models but adds meaningful latency on Tiny without a proportional accuracy gain. For a mobile health notes use case, greedy is the right trade-off.

The INT8 quantization is worth it. Whisper Tiny INT8 is noticeably faster than FP32 on ARM CPUs, with negligible accuracy degradation for English speech. The model size reduction also matters for app store bundle size.

Hallucination on silence is real. Whisper famously hallucinates transcriptions on silence or very quiet audio — producing phrases like "Thank you for watching" or "Subtitles by..." The RMS-based silence filter catches most of this, but the clean solution is also the application-level check: if the whole clip is below the RMS threshold, skip inference entirely and return an empty result.

Dart Isolate message passing has a cost. Sending large Float32List buffers between isolates involves a copy. For 28-second audio at 16kHz (448,000 float samples, ~1.75 MB), this copy is measurable. It is not a bottleneck in practice for this use case, but for higher-throughput scenarios it is worth using TransferableTypedData to transfer ownership instead of copying.

Why sherpa-onnx Over Direct ONNX Runtime

I explored direct ONNX Runtime integration in Flutter before settling on sherpa-onnx. The onnxruntime pub package works, but it gives you a raw tensor interface — you are responsible for mel spectrogram extraction, tokenization, the autoregressive decoding loop, and detokenization. For Whisper, that is a non-trivial amount of signal processing and inference orchestration to get right.

sherpa-onnx packages all of that into a single OfflineRecognizer API. You pass PCM samples. You get text. The underlying implementation uses ONNX Runtime with the CPU EP (with CoreML EP support available). The sherpa-onnx iOS plugin ships prebuilt XCFramework binaries, so there is no C++ compilation step in the Flutter build.

For a project where speech recognition is one component in a larger system — rather than the core product — sherpa-onnx is the right abstraction level.

The Bigger Context: HealthOS

This ASR pipeline is the first working slice of HealthOS: a patient-owned, local-first personal health record system.

The vision is straightforward. Health records are scattered across hospitals, specialists, labs, and dental offices — and none of them talk to each other. More importantly, the patient does not own their own data. HealthOS gives users a single private place to collect, understand, and control their lifetime health data: voice notes, scanned records, provider data via FHIR, vitals, and eventually a local LLM that can surface trends and help prepare for provider visits.

The voice pipeline I described here is the ingestion layer for unstructured health notes. It feeds into a data model where the transcript lives on a HealthRecord entity alongside metadata, source audio, and eventually entity-extracted fields for medications, conditions, and dates.

Every component follows the same principle the ASR pipeline embodies: process on-device by default, store locally, never transmit without an explicit user action. The speech recognition case is the clearest expression of why that matters — nobody should need to send an audio recording of their medical history to a server to get it transcribed.

Getting Whisper running natively on an iPhone turned out to be more straightforward than I expected, once I found the right layering. ONNX Runtime handles the hardware abstraction. sherpa-onnx handles the model-specific inference pipeline. Flutter Isolates handle the UI/inference separation. The remaining engineering — chunking, silence detection, merge logic — is application logic that you would have to write regardless of the inference stack.

The result is a transcription pipeline that runs fully offline, completes within seconds, and never touches a server. For a health application, that is not a feature — it is the foundation.