Running Local AI (LLMs) on Unraid with Ollama

Running Local AI (LLMs) on Unraid with Ollama: A System Admin's Guide

Welcome, fellow seeker of local computational enlightenment! In this guide, we will embark on a journey to harness the power of Large Language Models (LLMs) directly on your Unraid server using Ollama. This allows for private, offline AI experimentation, avoiding the costs and data privacy concerns associated with cloud-based services. We will walk through the process step-by-step, covering everything from pre-requisites to troubleshooting.

Why Ollama?

Ollama simplifies the process of running and managing LLMs. It handles the complexities of model downloads, dependencies, and resource allocation, allowing you to focus on actually using the AI. Think of it as Docker, but specifically for LLMs. It boasts a simple, easy-to-use CLI, Docker compatibility, and rapidly expanding model library.

Why Unraid?

Unraid provides a flexible and cost-effective platform for home servers. Its ability to manage diverse storage devices, its docker support, and its user-friendly interface makes it an ideal choice for running computationally intensive tasks like local AI.

Prerequisites

Before we begin, ensure you have the following in place:

  1. A Functioning Unraid Server: Naturally! Ensure you have a stable Unraid installation. I'm assuming you've already configured your storage arrays and are comfortable with the basic Unraid interface.
  2. Docker Enabled: Docker is fundamental to this setup. Verify that Docker is enabled within the Unraid settings. Go to Settings > Docker and confirm that Docker is enabled. If it's not, enable it and set appropriate Docker image and volume paths.
  3. Sufficient Resources: Running LLMs is resource-intensive. This means a decent CPU, ample RAM, and crucially, fast storage. While you can technically run models from your main array, you'll be much happier with the performance if you use a cache pool based on NVMe SSDs. Models will be downloaded, loaded, and accessed from this location.
  4. Optional: A Dedicated GPU (Nvidia Recommended): While Ollama can run on the CPU, you'll experience significantly faster inference times with a dedicated GPU, particularly an Nvidia GPU. Ollama automatically detects and utilizes CUDA-enabled GPUs. If you have an AMD GPU, the setup is more complex (requires ROCm) and is beyond the scope of this guide.
  5. Basic Linux Command Line Familiarity: While I'll provide detailed instructions, a basic understanding of navigating and executing commands in a Linux terminal will be helpful. You'll primarily interact with the Unraid server through its web terminal.
  6. Ollama Docker Image: This guide will cover the installation process, but good to have in mind that it's the core element.

Step-by-Step Instructions

Let's get our hands dirty! Follow these steps carefully.

Step 1: Install the Ollama Docker Container

We'll use the "Community Applications" plugin (CA) to install the Ollama Docker container. If you don't have CA installed, go to the Plugins tab, search for "Community Applications," and install it.

  1. Open the "Community Applications" tab.

  2. Search for "Ollama". You should find several community-contributed Docker templates. While any should work, look for one that has a good number of downloads and positive reviews. Select that one and click "Install".

  3. The "Add Container" window will appear. Pay close attention to the settings:

    • Name: Give the container a descriptive name, such as "ollama".

    • Repository: This should already be populated with the Docker image name, typically something like ghcr.io/jmorganca/ollama:latest. Verify it looks correct. ghcr.io/jmorganca/ollama is the official ollama repo.

    • Network Type: Leave this as "Bridge".

    • Port Mapping: By default, Ollama listens on port 11434. It's generally best to leave this as 11434:11434. If you have a conflict, change the container port, not the host port. Avoid conflicting with other services running on your Unraid server.

    • Volume Mappings: This is critical. You need to map a host path on your Unraid server to the container's Ollama data directory. The container path is /root/.ollama. This is where the models will be stored. Here's how to do it:

      • Click "Add Path".
      • In the "Container Path" field, enter /root/.ollama.
      • In the "Host Path" field, enter a directory on your cache pool (preferably). For example, you could create a folder named ollama_models on your cache pool and enter /mnt/cache/ollama_models. This is where your downloaded LLMs will reside.
    • Optional: Device Mappings (GPU): If you have an Nvidia GPU, you need to expose it to the container. This allows Ollama to utilize it. This step can be a bit tricky. Follow these steps:

      • First install the NVIDIA Driver plugin from the Unraid App Store.
      • Go to the Docker settings in Unraid, and enable NVIDIA runtime.
      • Click "Add Device".
      • Enter the device path, typically /dev/nvidia0 or /dev/nvidiactl. You might also need to add /dev/nvidia-uvm and /dev/nvidia-uvm-tools if you want the full benefit of the driver. Consult your GPU driver documentation if unsure.
    • Click "Apply".

  4. The Docker container will now be downloaded and installed. This may take some time depending on your internet connection.

Step 2: Start the Ollama Docker Container

  1. Go to the "Docker" tab in the Unraid interface.
  2. You should see your newly created "ollama" container.
  3. Click the "Start" button to start the container.

Step 3: Test Ollama via the Command Line

  1. Click the "Console" icon for the "ollama" container. This will open a terminal window into the container.

  2. Execute the following command to verify that Ollama is running and can access models:

    **ollama run llama2 "The best way to learn is..."**
    
    • This command tells Ollama to download and run the llama2 model (if it's not already downloaded) and then provide the prompt "The best way to learn is...".
    • The first time you run this, it will download the model, which can take a significant amount of time depending on your internet speed.
    • Subsequent runs will be much faster.
    • If you have a GPU enabled, the inference will be significantly faster.
    • You can substitute llama2 for other available models, such as mistralai/Mistral-7B-Instruct-v0.1. A full list can be found on the Ollama website or by using the ollama list command.
  3. Observe the output. You should see the model generating text based on your prompt. If you see an error, refer to the troubleshooting section below.

Step 4: Explore Other Models and Commands

Now that you have Ollama up and running, experiment with different models and commands:

  • ollama list: Lists the models you have downloaded locally.
  • ollama pull <model_name>: Downloads a specific model. For example, ollama pull llama2:13b will download the 13 billion parameter version of Llama 2.
  • ollama create <model_name>: Allows you to create your own models (more advanced, requires a Modfile).
  • ollama serve: Starts the Ollama API server (usually already running within the Docker container).
  • ollama cp <model_name>: Copies a model from one location to another.
  • ollama delete <model_name>: Deletes a model.

Step 5: Using Ollama in your Applications

Ollama exposes an API that you can use from your own applications. The easiest way is through a python SDK or javascript.

from ollama import Client
client = Client(host='http://localhost:11434')
response = client.generate('llama2', prompt='Why is the sky blue?')
print(response['response'])

Troubleshooting

  • "Ollama: command not found" Error: This typically means the Ollama executable is not in your container's PATH. Ensure you're executing the command from within the container's terminal (using the "Console" button in the Unraid Docker interface).
  • Slow Inference Times: This is usually due to running on the CPU. Ensure you have a dedicated GPU and that the Nvidia drivers are correctly installed and configured within Unraid. Verify that the container has access to the GPU devices. Check the Ollama logs for any errors related to GPU initialization. You may need to update your Nvidia drivers or CUDA libraries.
  • "No space left on device" Error: This means your Docker image or the volume where you're storing the models is full. Increase the size of your Docker image or ensure you're storing the models on a volume with sufficient free space (e.g., your cache pool). Consider cleaning up unused Docker images and containers.
  • Model Download Issues: If you're having trouble downloading models, check your internet connection. Also, verify that the model name is correct. Try downloading a smaller model first to rule out network issues.
  • Container Won't Start: Examine the container logs. Click the "Logs" icon in the Unraid Docker interface. This will provide valuable clues about why the container is failing to start. Common causes include port conflicts, incorrect volume mappings, or issues with the Nvidia drivers.
  • Unraid Crashing/Freezing: Running LLMs can put a heavy load on your system. Ensure that you are using enough RAM and that your CPU is not overheating.

Conclusion

Congratulations! You've successfully installed and configured Ollama on your Unraid server. You now have a powerful local AI experimentation platform at your fingertips. Explore the available models, experiment with different prompts, and integrate Ollama into your own applications. The possibilities are vast!

Remember, this is just the beginning. As Ollama evolves and the field of local AI continues to advance, be sure to stay up-to-date with the latest developments and best practices. Embrace the power of local computation and unlock the full potential of artificial intelligence within the comfort of your own home.

Now, about that crucial piece of hardware we identified at the beginning: the NVMe SSD for your Unraid cache pool. While Ollama can run on your array, you'll quickly realize the performance bottleneck. Downloading, loading, and running models from spinning rust is painfully slow. Investing in a dedicated NVMe SSD, or even better, a pair of them in a BTRFS RAID1 configuration, will dramatically improve the responsiveness and usability of your local AI setup. Think of it as fueling your AI engine with high-octane performance!

Call to Action:

If you're serious about running local AI models on Unraid with Ollama, don't skimp on storage. Upgrade your cache pool with an NVMe SSD today! You'll thank yourself every time you interact with your local LLMs. Consider a reputable brand like Samsung, Western Digital, or Crucial, and choose a capacity that meets your current and future model storage needs (500GB or larger is recommended). Make sure your motherboard supports NVMe and that you have an available M.2 slot. Go forth and unleash the power of local AI!