Build and Run llama.cpp in Termux on Old Android Phones (2026)

This post covers a much more technical and involving way to run local LLMs on android via a terminal setup compiling llama.cpp natively.
If you are looking for a simpler and faster way to run terminal-based LLMs using Ollama, check out my new dedicated guide: How to Install and Run Ollama in Termux on Android
If you want to try an easy-to-use graphic app, I have it covered in this latest blog post https://nkaushik.in/writing/top-4-ways-to-run-llm-locally-on-android-and-ios/

When LLM models were first launched, we had to rely on the cloud versions, like ChatGPT or Gemini. However, things are changing for the better. We are now seeing a wave of new AI models being released every week, and most of them can run locally with good performance. This allows us to perform AI inference on edge or even mobile devices.

I have been running these models on my Windows machine using Ollama for a while, and even on my latest high-end Android phone with apps like Google AI Edge Gallery, PocketPal, and AnythingLLM. Everything has been running smoothly. I’ve used these models for tasks like describing images or helping me improve my writing. The speed of inference is quite fast, thanks to the latest and fastest mobile SoCs they are running on.

BUT…

I wanted to try something different. I had an old Android phone (OnePlus 3T) lying around, and I always wondered if I could find a use for it. So, this weekend, I did a quick proof of concept.

I tested whether I could run these new AI models locally on this old Android device, which doesn’t have the most powerful hardware. Fortunately, tools like Unsloth, GGUF, and Llama.cpp are available.

In cases where we have limited RAM, the GGUF format of models lets us run them with much lower VRAM requirements while maintaining nearly the same accuracy.

To get started, I first needed to figure out how to run these models. Llama.cpp offers a CLI or server method to interact with your model. However, they provide pre-built binaries for Windows and Linux, not for Android, so we need to compile them for Android.

🛠️Installing Termux

To compile llama.cpp for Android, I needed a shell, and Termux was the solution. The latest version on the Play Store wasn’t compatible with my device, but I managed to download and install the APK from F-Droid. With Termux running, I was ready to proceed to the next step of compiling llama.cpp..

👷🏼Building llama.cpp

This was quite straightforward, all I had to do was follow the steps:

https://github.com/ggml-org/llama.cpp/blob/master/docs/android.md#build-cli-on-android-using-termux

⚠️

Build Safety Tip: If you try to clone the llama.cpp repository or build the binaries inside Android's shared storage folders (like /sdcard or /storage/emulated/0), Android will trigger a Permission Denied block when compiling. To fix this, always clone and compile inside the isolated home directory of Termux (~ or /data/data/com.termux/files/home).

🏃🏻Running the model

Once llama.cpp was built, the binaries were placed in the bin folder under the build directory. Then, I could start the server using

./build/bin/llama-server -m model-path -c 2048 -n 4096 -—host 0.0.0.0 --port 8080

Since I was going to connect to this server from a different device, I had to set the host to 0.0.0.0.

I tried using 4 different models with Q4_K_M quantization:

I also had to adjust the -c and -n option values based on the available RAM.

Once the server was running, I needed a client to connect to it.

First, I configured the AnythingLLM app. Llama.cpp starts a server that is compatible with the Generic OpenAI spec. However, I often encountered issues with the app when trying to use the model, and it simply wouldn’t work.

Then, I switched to Open-WebUI as a client, and it was an immediate success. As soon as I entered the details, it detected the model, and starting a chat with the model worked smoothly.

I tried running each model one by one to understand their performance. As expected, Qwen 3 0.6B was the fastest. The performance of these models was just okay. In some cases, I could get 10-18 tokens per second, while the Thinking models and larger models only produced 4-5 tokens per second. So, while they worked, the results weren’t very impressive.

A key point to remember is that the OnePlus 3T was launched 10 years ago, so I wasn’t expecting groundbreaking performance. For comparison, when I run local models on the Snapdragon 8 Elite chipset, I get over 30 tokens per second on the LFM2.5 Thinking model with Q8 quantization, along with almost instant Time to First Token(TTFT). This performance is better and more consistent with other models I ran on the same chipset.

Another downside of running these models on a mobile device is the heat. Extended use will always cause your device to start heating up.

In conclusion, running LLM models on older Android devices is a feasible endeavor, albeit with some limitations. By leveraging tools like Termux and llama.cpp, it’s possible to compile and execute AI models locally, even on hardware that is not cutting-edge. While performance may not match that of newer devices, and issues such as heat generation and slower token processing rates are present, this approach offers a valuable way to repurpose older technology. It demonstrates the potential for AI applications to be more accessible and versatile, allowing users to explore AI capabilities without relying solely on high-end devices or cloud-based solutions.

💬 Frequently Asked Questions (FAQ)

Can I compile and run llama.cpp natively on Android?

Yes. By installing a compiler toolchain (using pkg install clang make cmake inside Termux) on any ARM64 Android device, you can clone and build llama.cpp from source. This allows you to launch llama-server and host a highly customizable, generic OpenAI-compatible local endpoint on your phone.

How do I resolve compile or permission denied errors in Termux?

Most permission errors in Termux are caused by trying to clone the git repository or build binaries inside Android’s shared storage folders (e.g. /sdcard). To avoid this, always clone and compile your builds entirely within Termux’s isolated home directory (~ or /data/data/com.termux/files/home).

What is the best local LLM for Termux on old devices?

Lightweight models under 1.5B parameters are highly recommended. Models like Qwen 3 0.6B and LFM 2.5 1.2B quantized in Q4_K_M GGUF offer the most stable performance, generating between 10 to 18 tokens per second on older 4GB-6GB RAM mobile processors.

💡

Part 2 (Fixing Overheating & 24/7 Operations): https://nkaushik.in/writing/running-24-7-local-ai-on-an-old-android-without-overheating/