Akash/ai·ml
Back to blog
April 5, 2025·6 min read

Computer Vision on the Edge: Lessons from 5 Real Deployments

Running CV models on edge hardware is a completely different problem from running them in the cloud. Here's what five production deployments taught me about latency, quantization, and shipping hardware AI that stays working.

Computer VisionPyTorchONNXEdge AI

Cloud-based computer vision is straightforward: ship an image to an endpoint, get a prediction back. Edge CV is a different beast entirely. The model has to run on a Jetson, a Raspberry Pi, or a custom ASIC with hard latency requirements, no reliable internet, and hardware that doesn't get updated.

I've shipped CV systems for retail analytics, warehouse automation, agricultural monitoring, and industrial quality control. Here's what I learned the hard way.

Start with ONNX from Day One

The single decision that will save you the most time is committing to ONNX as your interchange format before you write the first line of training code. Frameworks diverge. Hardware runtimes evolve. ONNX stays stable.

import torch

model = YOLOv8DetectionModel()
dummy_input = torch.randn(1, 3, 640, 640)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=17,
    input_names=["images"],
    output_names=["output"],
    dynamic_axes={"images": {0: "batch_size"}},
)

From ONNX you can compile to TensorRT for Jetson, OpenVINO for Intel hardware, or CoreML for Apple Silicon — without touching your training pipeline.

Quantization: INT8 or You're Not Edge-Ready

FP32 inference on a Jetson Nano is almost always too slow. INT8 quantization typically delivers 3–4× speedup with accuracy loss under 1% for detection tasks if you calibrate properly.

The calibration dataset matters more than most people expect. I use 200–500 representative images from the target environment — not the training distribution — as the calibration set. Models calibrated on their training data hallucinate confidence on real-world edge cases.

# TensorRT INT8 calibration
trt_model = torch2trt(
    model,
    [dummy_input],
    fp16_mode=False,
    int8_mode=True,
    int8_calib_dataset=CalibrationDataset(calib_images),
    int8_calib_algorithm=trt.CalibrationAlgoType.ENTROPY_CALIBRATION_2,
)

Latency Budgeting Before Architecture Selection

On three of my five deployments, the model was already in training before anyone had written down a latency requirement. This is backwards. The hardware and the required frames-per-second should determine your model architecture, not the other way around.

My process now:

  1. Profile the target hardware (inference runtime benchmark with a dummy model)
  2. Define the latency SLA with the client (e.g. <50ms per frame)
  3. Select the largest model that fits the budget, with 20% headroom for pre/post-processing
  4. Only then start training

YOLOv8n runs at ~12ms on a Jetson Xavier NX. YOLOv8m runs at ~35ms. If your SLA is 40ms, you have room for YOLOv8m — but not if you haven't accounted for preprocessing (decoding, resize, normalize) which adds 5–15ms depending on resolution.

The Thermal Problem Nobody Talks About

Edge devices throttle under sustained load. A Jetson AGX running at full capacity in a 35°C warehouse will thermal-throttle within 20–30 minutes, dropping inference speed by 30–40%. I've seen systems that passed QA in an air-conditioned lab fail acceptance testing on the factory floor.

Solutions: active cooling where possible, thermal profiling at ambient temperature, and building a 25% performance buffer into your latency budget for hot environments.

Monitoring on the Edge

Edge devices often run in locations with intermittent connectivity. I deploy a lightweight sidecar process on every device that buffers metrics locally and syncs when connectivity is available:

  • Inference latency (P50, P95, P99)
  • Detection confidence distribution (drift indicates scene change or model degradation)
  • CPU/GPU temperature and clock frequency (thermal health)
  • False positive rate from a sample of flagged frames sent for human review

Confidence distribution drift is the early warning signal for model degradation. If the distribution shifts right (overconfident) or left (underconfident) without a corresponding ground-truth accuracy change, something has changed in the scene — lighting, camera angle, or object appearance — and the model needs retuning.

What I'd Tell Myself on Deployment #1

Prototype in PyTorch, export to ONNX on day one, quantize before you benchmark, and instrument everything. The models are the easy part. The hard part is keeping them working in the real world.

If you're planning an edge CV project, let's talk. Getting the architecture right before you start training saves weeks.


Building something with AI?

I help teams ship production-grade AI systems. Let's talk.

Get in touch