Just Experiments

The goal is to run a real LLM behind a REST API on a low-cost machine, understand the serving stack, and measure basic performance.

This tutorial walks through serving a local LLM using Docker, running entirely on CPU (Raspberry Pi 5 + SSD), and ends with latency and throughput measurements.

No cloud. No GPU. Just fundamentals.

Workflow overview

Client (curl / app)
   ↓ HTTP
LLM Runtime (llama.cpp)
   ↓
Model (Phi-3 mini, GGUF)

Install Docker

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

Verify:

docker run hello-world

Choose the runtime

We use llama.cpp, a lightweight LLM runtime optimized for CPU, ARM, low memory. Also note that the key point here is that the runtime provides the REST API — the model does not.

Download a compatible model (GGUF)

For CPU inference, we need a small model, quantized and GGUF format

Example:

Phi-3 Mini 4K Instruct
Quantization: q4_k_m

Note: Hugging Face repos often contain multiple files. Only .gguf files are valid for llama.cpp.

Download the model at:

wget https://huggingface.co/TheBloke/Phi-3-mini-4k-instruct-GGUF/resolve/main/Phi-3-mini-4k-instruct-q4_k_m.gguf

Place it in:

./models/Phi-3-mini-4k-instruct-q4_k_m.gguf

Build the Docker image

Create a simple Dockerfile:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

# System deps
RUN apt update && apt install -y \
    git cmake build-essential \
    python3 python3-pip \
    curl wget \
    libopenblas-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Build llama.cpp
RUN git clone https://github.com/ggerganov/llama.cpp.git
WORKDIR /app/llama.cpp

RUN cmake -B build \
    -DLLAMA_BLAS=ON \
    -DLLAMA_BLAS_VENDOR=OpenBLAS \
    && cmake --build build --config Release -j$(nproc)

EXPOSE 8000

ENTRYPOINT ["./build/bin/llama-server"]

Build:

docker build -t llama-pi .

Run the model with Docker Compose

What this does ? It loads the model, starts an HTTP server, uses 4 CPU threads and exposes an OpenAI-compatible API.

version: "3.9"

services:
  llama-api:
    image: llama-pi
    container_name: llama-api
    ports:
      - "8000:8000"
    volumes:
      - ./models:/models
    restart: unless-stopped
    command: >
      -m /models/Phi-3-mini-4k-instruct-q4.gguf
      --host 0.0.0.0
      --port 8000
      -t 4

Test the REST API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Explain what an LLM runtime is" }
    ],
    "max_tokens": 100
  }'

If this works, is because the runtime is alive, the model is loaded, and you have a local LLM API. The output should be something like this below.

{
  "models": [
    {
      "name": "Phi-3-mini-4k-instruct-q4.gguf",
      "model": "Phi-3-mini-4k-instruct-q4.gguf",
      "modified_at": "",
      "size": "",
      "digest": "",
      "type": "model",
      "description": "",
      "tags": [
        ""
      ],
      "capabilities": [
        "completion"
      ],
      "parameters": "",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "",
        "families": [
          ""
        ],
        "parameter_size": "",
        "quantization_level": ""
      }
    }
  ],
  "object": "list",
  "data": [
    {
      "id": "Phi-3-mini-4k-instruct-q4.gguf",
      "object": "model",
      "created": 1770059503,
      "owned_by": "llamacpp",
      "meta": {
        "vocab_type": 1,
        "n_vocab": 32064,
        "n_ctx_train": 4096,
        "n_embd": 3072,
        "n_params": 3821079552,
        "size": 2392493568
      }
    }
  ]
}

Test asking a question

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi",
    "messages": [
      { "role": "user", "content": "What is an LLM runtime?" }
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }'

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " LLM runtime refers to the environment and infrastructure used to execute and manage the operations of a Large Language Model (LLM),
        such as GPT-3 or similar advanced AI models developed by organizations like Microsoft. This runtime includes the hardware (like GPUs or TPUs),
        software frameworks, and libraries that support the model's functionality, allowing it to process and generate natural language responses
        or perform text-based tasks. Here are some key components of an LLM runtime:**Hardware**: LLMs require significant
        computational resources, so they are typically run on specialized hardware such as high-end GPUs or TPUs.
        These hardware components provide the necessary processing power and memory"
      }
    }
  ],
  "created": 1770061079,
  "model": "Phi-3-mini-4k-instruct-q4.gguf",
  "system_fingerprint": "b7914-9f682fb64",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 150,
    "prompt_tokens": 11,
    "total_tokens": 161
  },
  "id": "chatcmpl-Tw8HC3CxvXRGZDljCMf880F60ep4Bblg",
  "timings": {
    "cache_n": 0,
    "prompt_n": 11,
    "prompt_ms": 1078.738,
    "prompt_per_token_ms": 98.06709090909091,
    "prompt_per_second": 10.197100686172174,
    "predicted_n": 150,
    "predicted_ms": 41860.764,
    "predicted_per_token_ms": 279.07176000000004,
    "predicted_per_second": 3.5833077485160088
  }
}

Measure basic latency

time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages":[{"role":"user","content":"Hello"}],
    "max_tokens": 50
  }'

Observe:

Total request time
CPU usage

Time To First Token (TTFT)

curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "stream": true,
    "messages":[{"role":"user","content":"Hello"}]
  }'

TTFT directly impacts UX.

Measure throughput

llama.cpp logs provide:

tokens per second