Running LLMs Locally with Docker Model Runner

A comprehensive guide to running large language models locally using Docker's new Model Runner feature, including setup, API integration, and building a chat application.

Running large language models (LLMs) locally has become increasingly popular for developers who want privacy, cost control, and offline capabilities. Docker's new Model Runner feature makes this process incredibly simple by providing a unified way to run AI models with just a few commands. In this guide, I'll show you how to set up and run LLMs locally using Docker's Model Runner.

LLM Docker Setup

Why Run LLMs Locally?

🔒 Privacy - Your data never leaves your machine
💰 Cost Control - No per-token API costs
🌐 Offline Access - Works without internet connection
⚡ Performance - No network latency
🛠️ Customization - Full control over model parameters

What is Docker Model Runner?

Docker Model Runner (DMR) is a new feature that allows you to run AI models directly through Docker with a simple command-line interface. It provides:

Easy Model Management - Pull and run models with simple commands
OpenAI-Compatible API - Standard REST endpoints for integration
GPU Support - Automatic GPU acceleration when available
Model Registry - Access to pre-built models from Docker Hub

Prerequisites

Docker Desktop (latest version)
At least 8GB RAM (16GB+ recommended)
Optional: NVIDIA GPU for better performance

Getting Started

Step 1: Download a Model

Docker Model Runner provides access to various pre-built models. Let's start with Google's Gemma 3 model:

docker model pull ai/gemma3

This command downloads the ai/gemma3 model from Docker Hub. You can browse other available models at the Docker Hub Model Registry.

Step 2: Run the Model

Once downloaded, you can run the model with a single command:

docker model run ai/gemma3

That's it! You now have a running LLM locally. The model will start and be available through Docker's Model Runner API.

Building a Chat Application

Now that you have a running LLM, let's build a practical chat application to demonstrate its capabilities.

Step 3: Enable Host-Side TCP Support

To access the model from your host machine, you need to enable TCP support in Docker Desktop:

Enable Host-Side TCP Support

Open Docker Desktop
Go to Settings → AI
Enable "Host-side TCP support"
Set the port (default: 12434)

Step 4: Clone the Example Application

I've created a complete Node.js chat application that demonstrates how to integrate with Docker Model Runner:

git clone https://github.com/sourabhs701/node_docker_genai.git
cd node_docker_genai

Step 5: Install Dependencies

pnpm install

Step 6: Start the Application

pnpm dev

The application will start on http://localhost:3000 and connect to your running Gemma 3 model.

API Integration

Docker Model Runner provides OpenAI-compatible endpoints, making it easy to integrate with existing applications. Here's how to use the API:

Available Endpoints

The Model Runner exposes several endpoints as documented in the DMR REST API Reference:

Chat Completions: POST /engines/llama.cpp/v1/chat/completions
Text Completions: POST /engines/llama.cpp/v1/completions
Embeddings: POST /engines/llama.cpp/v1/embeddings
Model Management: GET /models, POST /models/create

Example API Usage

// Using fetch to call the chat completions endpoint
const response = await fetch(
  "http://localhost:12434/engines/llama.cpp/v1/chat/completions",
  {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "ai/gemma3",
      messages: [
        {
          role: "user",
          content: "Explain Docker in simple terms",
        },
      ],
      max_tokens: 150,
      temperature: 0.7,
    }),
  }
);
 
const data = await response.json();
console.log(data.choices[0].message.content);

Available Models

Docker Hub hosts a growing collection of pre-built models. Some popular options include:

ai/gemma3 - Google's latest Gemma model
ai/llama2 - Meta's Llama 2 model
ai/mistral - Mistral AI's model
ai/codellama - Code-specialized Llama model

Browse all available models at the Docker Hub Model Registry.

Advanced Usage

Custom Model Configuration

You can customize model behavior by passing parameters:

docker model run ai/gemma3 --temperature=0.8 --max-tokens=200

Multiple Models

Run multiple models simultaneously on different ports:

# Terminal 1
docker model run ai/gemma3 --port=12434
 
# Terminal 2
docker model run ai/llama2 --port=12435

Conclusion

Docker Model Runner makes running LLMs locally incredibly simple and accessible. With just a few commands, you can have a fully functional AI model running on your machine with a standard API interface.

The combination of Docker's containerization benefits with easy model management creates a powerful platform for local AI development. Whether you're building chatbots, content generators, or experimenting with AI, Docker Model Runner provides the foundation you need.

Resources

Happy coding and AI experimenting! 🚀