ljcamargo/kotlinllamacpp: Run GGUF models on your android app with ease!

Kotlin-LlamaCpp

Implementing GGUF Local Inference into Android ARM Devices with EASE

Native AI inference for Arm-based Android devices

Run GGUF models directly on your Arm-powered Android device with optimized performance and zero cloud dependency!

This is an Android binding for llama.cpp written in Kotlin, designed specifically for native Android applications running on Arm architecture. Built from the ground up to leverage Arm CPU capabilities, this library brings efficient large language model inference to mobile devices. The project is inspired by cui-llama.rn and llama.cpp: Inference of LLaMA model in pu…

Kotlin-LlamaCpp

Implementing GGUF Local Inference into Android ARM Devices with EASE

Native AI inference for Arm-based Android devices

Run GGUF models directly on your Arm-powered Android device with optimized performance and zero cloud dependency!

This is a very early alpha version and API may change in the future.

News

Content Resolver has been implemented for new versions of Android to allow local file access
Library has been updated to comply with 16kb pagination now enforced

Why On-Device AI on Arm?

Most modern Android devices run on Arm processors, making Arm the dominant architecture for mobile AI applications. Kotlin-LlamaCpp is built specifically for this ecosystem, enabling:

True On-Device AI: Run large language models entirely on your Arm-based phone or tablet—no internet required, complete privacy
Arm-Optimized Performance: Automatic detection and utilization of Arm CPU features (i8mm, dotprod) for hardware-accelerated inference
Mobile-First Design: Built from the ground up for Arm’s power-efficient architecture, balancing performance with battery life
Real-World Usability: Context management and batch interruption designed for the constraints of mobile Arm processors

The vast majority of Android devices today are powered by Arm processors (Snapdragon, MediaTek, Exynos, Tensor). This library is optimized specifically for this architecture, bringing desktop-class AI capabilities to the devices already in your users’ pockets.

Features

Native Arm Architecture Support: Built for arm64-v8a with automatic CPU feature detection (i8mm and dotprod flags)
Hardware-Accelerated Inference: Leverages Arm-specific instruction sets for optimized matrix operations
Efficient Mobile Inference: Context Shift support from kobold.cpp enables longer conversations without memory overflow
Kotlin-First Design: Helper class to handle initialization and context management seamlessly
Flexible Control: Support for stopping prompt processing between batches, crucial for responsive mobile UIs
Progress Monitoring: Real-time callback support for tracking inference progress
Tokenizer Support: Vocabulary-only mode with synchronous tokenizer functions
Seamless Android Integration: Works naturally with Android development workflows and lifecycle management

Demo App

You can find a complete, ready-to-build demo application in the /app directory of this repository. The demo showcases how to integrate the library into a standard Android app, including model loading from local storage, handling inference in a ViewModel, and displaying generated text in a Jetpack Compose UI.

Installation

Add the following to your project’s build.gradle:

dependencies {
implementation 'io.github.ljcamargo:llamacpp-kotlin:0.2.0'
}

Model Requirements

You’ll need a GGUF model file to use this library. You can:

Download pre-converted GGUF models from HuggingFace
Convert your own models following the llama.cpp quantization guide

Quantized models (Q4, Q5, Q8) work particularly well on Arm mobile processors, providing an excellent balance between model quality and inference speed.

Usage

Check this example ViewModel using LlamaHelper class for basic usage:

class MainViewModel: ViewModel(val contentResolver: ContentResolver) {

private val viewModelJob = SupervisorJob()
private val scope = CoroutineScope(Dispatchers.IO + viewModelJob)

private val _llmFlow = MutableSharedFlow<LlamaHelper.LLMEvent>(
replay = 0,
extraBufferCapacity = 64,
onBufferOverflow = BufferOverflow.DROP_OLDEST
)
val llmFlow: SharedFlow<LlamaHelper.LLMEvent> = _llmFlow.asSharedFlow()

private val _generatedText = MutableStateFlow("")
val generatedText = _generatedText.asStateFlow()

private val llamaHelper by lazy {
LlamaHelper(
contentResolver = contentResolver, // <-- Recent android versions now require resolver to access local files
scope = scope,
sharedFlow = _llmFlow,
)
}

// load gguf model into memory
fun loadModel() {
llamaHelper.load(
path = "/sdcard/Download/llama.ggmlv3.q4_0.bin",
contextLength = 2048,
) {
// MODEL SUCCESSFULLY LOADED (it: context id)
// TODO: Update your UI to allow prompts
}
}

// model should be loaded before submitting or an exception will be thrown
fun generate(prompt: String) {
scope.launch {
llamaHelper.predict(prompt)
llmFlow.collect { event ->
when (event) {
is LlamaHelper.LLMEvent.Started -> {
// Update your UI to show the gen started
}
is LlamaHelper.LLMEvent.Ongoing -> {
// A new token has been generated, update your UI accordingly
// vb.g. _generatedText.value += event.word
}
is LlamaHelper.LLMEvent.Done -> {
// Update your UI to show the gen completed
llamaHelper.stopPrediction()
}
is LlamaHelper.LLMEvent.Error -> {
// Update your UI to show the gen error
llamaHelper.stopPrediction()
}
else -> {}
}
}
}
}

// you can abort the model load or prediction in progress
fun abort() {
llamaHelper.abort()
}

// don't forget to release resources when your viewmodel is destroyed
override fun onCleared() {
super.onCleared()
llamaHelper.abort()
llamaHelper.release()
}
}

You can also use LlamaContext.kt directly to handle several contexts or other complex features.

Performance on Arm Architecture

Kotlin-LlamaCpp is optimized specifically for arm64-v8a, the architecture powering the vast majority of modern Android devices:

Arm CPU Extensions: Automatic detection and utilization of i8mm (integer matrix multiplication) and dotprod instructions provides significant performance improvements for AI workloads
Memory Efficiency: Designed to work within mobile device constraints while maintaining responsive performance
Batch Interruption: Critical for Arm mobile processors, allowing the UI to remain responsive during inference
Power Efficiency: Native Arm optimizations help balance inference performance with battery life—essential for mobile use cases
64-bit Optimization: Recommended arm64-v8a platform for better memory allocation and performance

The library currently supports arm64-v8a android devices and platforms.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

Acknowledgments

This project builds upon the work of several excellent projects:

llama.cpp by Georgi Gerganov
cui-llama.rn
llama.rn

Kotlin-LlamaCpp

Implementing GGUF Local Inference into Android ARM Devices with EASE

Kotlin-LlamaCpp

Implementing GGUF Local Inference into Android ARM Devices with EASE

News

Why On-Device AI on Arm?

Features

Demo App

Installation

Model Requirements

Usage

Performance on Arm Architecture

Contributing

License

Acknowledgments

Similar Posts