Locally This, Locally That
Last updated: April 23, 2026
I built an AI app that runs completely on device. No servers, API requests, or data leaving the device. Local models are finally getting good. Just look at the recent releases from Qwen and Gemma to see that the gap is shrinking.
I am excited about what this unlocks for hardware, consumer, and privacy.
Robots and edge devices can't rely on perfect signal for their user experience. A robot working in a farm, a construction site, or even a living room can't just freeze up the moment it loses connection. Smart glasses and similar devices running local models have lower latency and much better UX.
For consumer apps, local LLMs make free tiers economically viable. Growth at all costs becomes very expensive when every user burns GPUs on your dime. Local inference flips that math.
The third vector is privacy. I have mixed feelings here. Anecdotally, most of my friends don't care where their data goes. My guess is, in the future, privacy-sensitive fields like law and medicine will require on-device or on-prem models to stay compliant. Users won't care, but regulated professionals will.
The hardware is already here. We're all walking around with phones that can comfortably run 2B-parameter models. So I wanted to see how far I could push that with a real app.
Ambrosia
I built Ambrosia, an offline AI journaling app to track how my skin responds to different diets and products. I picked skin specifically for two reasons: I've always wanted a better way to watch conditions and progress over time, and photos are a natural unit of a journal entry.
Two things I love about the app:
- AI-generated labels and trend tracking, all local.
- Everything including the images stays on device.
The Base Model
I went with Qwen3-VL-2B: multimodal, small enough to run on my iPhone, and already well-optimized for llama.cpp. I used the Q4_K_M quantized version from Hugging Face, and ran everything through RunAnywhere because their on-device SDKs are the best I've used.
Finetuning the Model Locally
I wanted to prove that I could take a base model and finetune it end-to-end locally on my MacBook (M2 Max).
With Codex, I trained a small LoRA adapter on MedQA, then merged and exported it back into the same Q4_K_M GGUF format so that it would work on my iPhone.
I also experimented with autoresearch, Karpathy's autonomous framework for model training, to maximize performance. I gave Codex the source code and let it sweep configurations. The sweep landed on a lower learning rate (1e-4) as the winner, beating baseline by ~5 points on the micro-run. Scaled up to full validation, it held at 47.88%.
Finetuned model: huggingface.co/amankishore/qwen3-vl-2b-medqa-gguf
Constraining the Task
Initially I built Ambrosia as a chat first experience. It was bad. I would document a new skin issue, and the model would respond with three long paragraphs about consulting a dermatologist.
I constrained the task. Qwen was much better at generating labels from images and analyzing trends across journal entries. Ambrosia went from a worse medical ChatGPT to a journal with ambient AI trend analysis.
This is why small models are underrated. They should not be compared to general assistants like ChatGPT. They're much better as narrow specialists constrained to a few tasks.
Since this task is disjoint from MedQA, I built a small eval set of journal entries a user might track (acne, texture, redness, etc.) and compared the finetuned model against the base. The finetune gave a small but real lift on label quality, while trend analysis was unchanged. For a 2B model doing a task it wasn't trained for, I'll take it.
Edge Intelligence
Ambrosia was a small experiment, but it maps to the shape of the future: take a small base model, aggressively constrain the domain, tune for your specific task, ship it. Repeat.
Chips keep getting faster. Small models keep getting smarter. Eventually these lines will converge. When they do, you'll have the power of AGI, in the palm of your hand.