Skip to content

how to pick a local AI model for your phone in 2026

guides Fist clutches a glowing green AI chip shooting blue lightning in a graffitied room — local AI model running on your phone.


your phone can now run ai without the internet – no server, no subscription, no one reading your chats. the model lives on your device and works offline. in 2026 this finally works well enough for everyday use. the hardware caught up.

the signs are everywhere. openai is reportedly working on a phone where ai replaces apps entirely. developers are already shipping react native apps with local ai baked in, no cloud required.

but here's the problem: not every model runs well on every phone. pick the wrong one and it crawls, drains your battery in an hour, or simply refuses to load because it doesn't fit in memory. there are real tradeoffs and they're not obvious.

this guide walks you through every criterion you need to make a smart choice – explained in plain language, no assumed knowledge.


checklist by thehype

Three-section table: phone local model requirements, phone requirements, app requirements — covering RAM, storage, NPU, tokens/sec, runtime.

1. ram


ram is your phone's short-term working memory – like a desk where work happens. the model's size in memory must fit alongside the operating system – android or ios, the app that actually runs your ai, and background processes (apps running silently in the background).

a phone showing "8 gb ram" might only have 3-4 gb free, while the optimal free ram is around 4-6 gb. you need to know your actual free ram, not total.


2. storage


models are heavy. you better have several gb free. but also consider: where is the model stored?

internal storage is faster than sd card (a removable memory card you insert into the phone). some apps stream from sd card (read the model piece by piece instead of loading it) which causes stutters.

also factor in that you may want to try multiple models, so headroom (extra free space) matters.


npu / chipset


an npu is neural processing unit – a chip designed specifically to run ai fast.

check whether your runtime (the software engine running the model) actually uses the npu – many apps use the gpu (graphics chip, also good at ai math) instead, and some fall back to cpu (the main processor, slowest for ai) entirely.

cpu-only inference (running the model only on the main processor) on a large model is unusably slow.

metal (apple's system for using the gpu on iphone/ipad) and vulkan (android's equivalent) are the main gpu paths. some snapdragon and dimensity chips (popular phone processor brands) have dedicated ai engines but software support varies.

tokens per second


tokens per second show how fast the model produces text – your main speed indicator.

a token is roughly ¾ of a word – so 10 tok/s means about 7-8 words per second. below ~8 tok/s feels sluggish for chat. above ~20 is comfortable. this depends on model size, quantization (compression level – explained below), chip, and runtime together – not any one factor alone.


time to first token


separate from generation speed. this is how long before the model starts responding.

cold load time (starting from scratch, model not in memory yet) can be 10-30 seconds. warm load (model already loaded and ready) is much faster. matters a lot if you open and close the app frequently.

context window


this is how much text the model can hold in one session – including your messages, its replies, and any documents you paste. it's measured in tokens.

small context (2k-4k tokens, about 1,500-3,000 words) means it forgets earlier parts of long conversations. large context (32k+, about 24,000+ words) is more demanding on ram and speed.

match this to your actual usage: short q&a needs little, document summarization needs a lot.


training data cutoff


determines what the model knows about the world. a model with a cutoff in early 2023 won't know about anything that happened after. for factual queries about recent events this matters a lot. for coding help or reasoning tasks it matters much less.


runtime / app support


the model format (the file type the model is saved in – like gguf, onnx, or executorch) must be supported by your app. not every app supports every format.

iOS App Store listings for PocketPal AI, Liquid Apollo, and Enclave Local AI Assistant — three apps to run local AI models on iPhone.

quantization level


quantization compresses model weights (the numbers that encode everything the model knows) to take up less space and ram, at the cost of some accuracy.

think of it like image compression – a heavily compressed jpeg is smaller but looks worse. lower bit = smaller and faster but more degraded. the degradation is not linear – going from q8 (high quality, large) to q4 (half the size, barely worse) loses little, going from q4 to q2 loses a lot.

also: the type of quantization matters, not just the bit level. k-quants (smarter compression methods that preserve important weights) like k_m and k_s are generally better than naive rounding (compressing everything equally regardless of importance).


license


some models are open weights (the model files are publicly downloadable) but restrict commercial use (using it in a business or product), require attribution (crediting the creators), or prohibit certain applications. for pure personal use this rarely matters. if you're building something on top of it, check carefully.

common licenses you'll encounter:

  • apache 2.0 – most permissive. use freely, commercially, modify, distribute. just keep the license notice. no restrictions worth worrying about for most people.
  • mit – similarly permissive. very short and simple. do almost anything, just credit the author.
  • llama community license – meta's custom license for llama models. free for most uses but has restrictions above 700 million monthly active users, and prohibits using it to train other models.
  • gemma terms of use – google's license for gemma models. allows commercial use but prohibits using outputs to train competing models and restricts certain harmful use cases.
  • cc-by-4.0 (creative commons attribution) – use and modify freely including commercially, but must credit the creator.
  • cc-by-nc-4.0 (creative commons non-commercial) – same but no commercial use at all. common on older or research-focused models.
  • bigscience openrail-m – designed specifically for ai models. permissive but includes an explicit list of prohibited use cases (weapons, surveillance, disinformation, etc.).
  • proprietary / custom – some models have fully custom licenses written by the company. always read these carefully as they vary wildly.

our top 5 local ai models for phones


for your comfort, we've picked some of the best local models to try in 2026 — all in gguf format, all phone-ready.

Table of 6 phone-ready local AI models: Gemma 4 e2b, Qwen3.5 0.8b/4b, LFM2.5 1.2b Thinking, Nemotron 3 Nano 4b, Granite 4.1 3b.

gemma 4 e2b (unsloth) – google's flagship mobile model. handles text, images, and audio in one. the most downloaded phone-ready model right now for a reason.

unsloth/gemma-4-E2B-it-GGUF · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

qwen3.5 0.8b (unsloth) – the smallest viable option. if your phone is older or low on ram, start here. surprisingly capable for its size.

unsloth/Qwen3.5-0.8B-GGUF · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

qwen3.5 4b (unsloth) – the sweet spot of the qwen family. multimodal, fast, and well-tested by the community.

unsloth/Qwen3.5-4B-GGUF · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

lfm2.5 1.2b thinking (liquidai) – tiny but smart. has a built-in reasoning mode that lets it think through complex questions before answering. unusually capable for 1b.

LiquidAI/LFM2.5-1.2B-Thinking-GGUF · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

nemotron 3 nano 4b (nvidia) – nvidia's own edge model. solid all-rounder with strong instruction following, backed by a major lab.

nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

granite 4.1 3b (unsloth) – ibm's open model. good for structured tasks, reliable, and actively maintained.

unsloth/granite-4.1-3b-GGUF · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/bartowski/ibm-granite_granite-4.1-3b-GGUF

all of these run on a modern phone with 6gb+ of free ram at q4 quantization. if your phone has less, stick to the 0.8b or 1.2b options. if you have a flagship with 8gb+ free, any model on this list will run comfortably.

start with gemma 4 e2b if you're unsure – it covers the most ground.

Stay in the loop

Get the latest AI news delivered to your inbox weekly

Thanks for subscribing!