How to Install and Run Local AI using llama.cpp on Linux

llama.cpp is a lightweight, efficient framework for running large language models on local machines with minimal dependencies. It’s optimized for performance, especially on CPUs, and supports GPU acceleration. This guide will walk you through the steps to install and run llama.cpp on a Linux system, making it accessible even for those new to the process.

Prerequisites

Before you begin, ensure you have the following:

A Linux distribution (e.g., Ubuntu 20.04 or later, Debian).
Basic familiarity with the terminal.
Git, a C++ compiler (e.g., g++), and CMake installed.
Optional: A compatible GPU with appropriate drivers for GPU acceleration. For this guide we will use Nvidia GPU.
At least 8GB of RAM (more for larger models).
Sufficient disk space for the model files (e.g., 5GB+ for smaller models like LLaMA 7B).

Step 1: Install Dependencies

First, install the necessary tools and libraries. Open a terminal and run the appropriate commands for your Linux distribution.

For Ubuntu/Debian-based systems:

sudo apt update
sudo apt install git build-essential cmake libopenblas-dev

Step 2: Clone the llama.cpp Repository

Clone the llama.cpp repository from GitHub:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Step 3: Build llama.cpp

The build process depends on whether you want to use CPU-only or GPU acceleration.

CPU-Only Build

To build llama.cpp for CPU usage:

cmake -B build
cmake --build build --config Release

GPU Build (NVIDIA)

Make sure to have the CUDA toolkit installed. For NVIDIA GPU support with CUDA:

cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release

If you encounter issues, check the llama.cpp documentation or ask AI chat provider for troubleshooting.

Step 4: Obtain a Model

To run llama.cpp, you need a compatible model in GGUF format. You can download pre-trained models from sources like Hugging Face. For example, to download a sample model (e.g., a quantized LLaMA model): Visit a model repository like TheBloke on Hugging Face.

Download a GGUF model file (e.g., llama-7b.Q4_0.gguf).

Place the model file in the llama.cpp directory. Note: Model sizes vary (e.g., 7B, 13B, 70B), and larger models require more RAM and storage. A 7B model typically needs 4-6 GB of RAM for inference.

Step 5: Run llama.cpp

Once the model is downloaded and llama.cpp is built, you can run inference. Use the main binary (for CPU) or the binary in the build directory (for GPU).

Example Command

To run the model interactively:

./build/bin/llama-cli -m llama-7b.Q4_0.gguf -p "Hello, how can I assist you today?"

-m: Path to the GGUF model file.
-p: The prompt to start the conversation.

For GPU builds, add --n-gpu-layers. You will have to test how much it is needed.

./build/bin/llama-cli -m llama-7b.Q4_0.gguf -p "Hello, how can I assist you today?" --n-gpu-layers 8

Follow the prompts to interact with the model.

Step 6: Optimize Performance (Optional)

To improve performance, you can tweak llama.cpp settings:

Quantization: Use smaller quantized models (e.g., Q4_0, Q5_0) to reduce memory usage.
Thread Count: Adjust the number of CPU threads with the -t flag (e.g., -t 8 for 8 threads).

Example with optimizations:

./build/bin/llama-cli -m llama-7b.Q4_0.gguf -p "Hello, how can I assist you today?" --n-gpu-layers 8 -t 8

Troubleshooting

Build Errors: Ensure all dependencies are installed and your compiler supports C++17.
Out of Memory: Use a smaller model or increase system RAM/swap space.
GPU Issues: Verify CUDA installation and compatibility with your GPU.
Model Not Found: Double-check the model file path and ensure it’s in GGUF format.

For detailed help, consult the llama.cpp documentation or ask AI chat provider for troubleshooting.

Conclusion

You’ve now installed and run llama.cpp on Linux! This powerful tool allows you to experiment with LLMs efficiently, whether on CPU or GPU. Explore different models, fine-tune parameters, and integrate llama.cpp into your projects. For advanced usage, check the official documentation for features like fine-tuning or server mode. Happy AI tinkering!