logo

How to Run a Local Vision Model using llama.cpp on Linux

June 6, 2025By LLM Hard Drive Store
How to Run a Local Vision Model using llama.cpp on Linux
visionlocal AIllama.cpp

In this article, we will be learning how to run a vision model on llama.cpp on Linux.

Prerequisites

Before you begin, ensure you have the following:

  • A Linux distribution (e.g., Ubuntu 20.04 or later, Debian).
  • Basic familiarity with the terminal.
  • Git, a C++ compiler (e.g., g++), and CMake installed.
  • A compatible GPU with appropriate drivers for GPU acceleration.
  • At least 8GB of RAM (more for larger models).
  • Sufficient disk space for the model files.

Step 1: Install llama.cpp with GPU build

Follow this article or llama.cpp for installation guide.

Step 2: Obtain a model

To run llama.cpp vision, you need a compatible model in GGUF format and a mmproj GGUF model. You can download pre-trained models from sources like Hugging Face.

For example, to download a sample model GGUF and mmproj GGUF (e.g., a quantized Qwen2.5-VL-7B):

Qwen2.5-VL-7B-Instruct-Q5_K_M.gguf

Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf

In order to load these file, you'll need a system RAM and your GPU's VRAM greater than the GGUF file listed above for K-quant (5.44 GB) and mmproj (1.35 GB).

Place the model files in the llama.cpp directory.

Step 3: Run llama.cpp server

Once the model is downloaded and llama.cpp is built, you can run inference server.

Example Command

To run the server:

./build/bin/llama-server -m Qwen2.5-VL-7B-Instruct-Q5_K_M.gguf --mmproj Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf --n-gpu-layers 10
  • -m: Path to the GGUF model file.
  • --mmproj: Path to the mmproj GGUF model file.
  • --n-gpu-layers: number of layers to store in VRAM, adjust this setting based on VRAM available.

Step 4: Test an image on your http://localhost:8080/

image

image You can see that the model has analyzed the reptile image.

Step 5: Run an API request using Node.js (Optional)

npm install axios
import axios from 'axios';
import fs from 'fs';

function encodeImage(imagePath) {
    const image = fs.readFileSync(imagePath);
    return Buffer.from(image).toString('base64');
  }


const base64Image = encodeImage('Path to your image file');

async function analyzeImage(){
    const response = await axios.post(
        'http://localhost:8080/v1/chat/completions',
        {
            "model": "gpt-3.5-turbo",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a helpful assistant that analyzes image"
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": `data:image/jpeg;base64,${base64Image}`
                              },
                        },
                    ],
                }
            ],

            "temperature": 0
        },
        {
            headers: {
                'Authorization': `Bearer no-key`,
                'Content-Type': 'application/json',
            },
        }
    );
    console.log(response.data.choices[0].message.content)
}
await analyzeImage()
The image shows a close-up of a crocodile's eye. The eye is large and has a vertical slit pupil, which is characteristic of crocodiles. The skin around the eye is textured with scales, and the eye itself appears to be slightly open. The background is blurred, focusing attention on the eye.

For detailed help, consult llama.cpp documentation or ask AI chat provider for troubleshooting.

Conclusion

You’ve now installed and run a vision model on Linux! This powerful tool allows you to experiment with vision-based AI efficiently. Explore different vision models, fine-tune parameters, and integrate them into your projects.