When LLM models were first launched, we had to rely on the cloud versions, like ChatGPT or Gemini. However, things are changing for the better. We are now seeing a wave of new AI models being released every week, and most of them can run locally with good performance. This allows us to perform AI inference on edge or even mobile devices.
I have been running these models on my Windows machine using Ollama for a while, and even on my latest high-end Android phone with apps like Google AI Edge Gallery, PocketPal, and AnythingLLM. Everything has been running smoothly. I've used these models for tasks like describing images or helping me improve my writing. The speed of inference is quite fast, thanks to the latest and fastest mobile SoCs they are running on.
BUT…
I wanted to try something different. I had an old Android phone (OnePlus 3T) lying around, and I always wondered if I could find a use for it. So, this weekend, I did a quick proof of concept.
I tested whether I could run these new AI models locally on this old Android device, which doesn't have the most powerful hardware. Fortunately, tools like Unsloth, GGUF, and Llama.cpp are available.
In cases where we have limited RAM, the GGUF format of models lets us run them with much lower VRAM requirements while maintaining nearly the same accuracy.
To get started, I first needed to figure out how to run these models. Llama.cpp offers a CLI or server method to interact with your model. However, they provide pre-built binaries for Windows and Linux, not for Android, so we need to compile them for Android.
🛠️Installing Termux
To compile llama.cpp for Android, I needed a shell, and Termux was the solution. The latest version on the Play Store wasn't compatible with my device, but I managed to download and install the APK from F-Droid. With Termux running, I was ready to proceed to the next step of compiling llama.cpp..
👷🏼Building llama.cpp
This was quite straightforward, all I had to do was follow the steps:
https://github.com/ggml-org/llama.cpp/blob/master/docs/android.md#build-cli-on-android-using-termux
🏃🏻Running the model
Once llama.cpp was built, the binaries were placed in the bin folder under the build directory. Then, I could start the server using
./build/bin/llama-server -m model-path -c 2048 -n 4096 -—host 0.0.0.0 --port 8080
Since I was going to connect to this server from a different device, I had to set the host to 0.0.0.0.
I tried using 4 different models with Q4_K_M quantization:
I also had to adjust the -c and -n option values based on the available RAM.
Once the server was running, I needed a client to connect to it.
First, I configured the AnythingLLM app. Llama.cpp starts a server that is compatible with the Generic OpenAI spec. However, I often encountered issues with the app when trying to use the model, and it simply wouldn't work.
Then, I switched to Open-WebUI as a client, and it was an immediate success. As soon as I entered the details, it detected the model, and starting a chat with the model worked smoothly.
I tried running each model one by one to understand their performance. As expected, Qwen 3 0.6B was the fastest. The performance of these models was just okay. In some cases, I could get 10-18 tokens per second, while the Thinking models and larger models only produced 4-5 tokens per second. So, while they worked, the results weren't very impressive.
A key point to remember is that the OnePlus 3T was launched 10 years ago, so I wasn't expecting groundbreaking performance. For comparison, when I run local models on the Snapdragon 8 Elite chipset, I get over 30 tokens per second on the LFM2.5 Thinking model with Q8 quantization, along with almost instant Time to First Token(TTFT). This performance is better and more consistent with other models I ran on the same chipset.
Another downside of running these models on a mobile device is the heat. Extended use will always cause your device to start heating up.
In conclusion, running LLM models on older Android devices is a feasible endeavor, albeit with some limitations. By leveraging tools like Termux and llama.cpp, it's possible to compile and execute AI models locally, even on hardware that is not cutting-edge. While performance may not match that of newer devices, and issues such as heat generation and slower token processing rates are present, this approach offers a valuable way to repurpose older technology. It demonstrates the potential for AI applications to be more accessible and versatile, allowing users to explore AI capabilities without relying solely on high-end devices or cloud-based solutions.