🚀 Deployment & Serving¶

Production-ready platforms and tools for deploying and serving LLMs at scale.

📋 Table of Contents¶

Overview
Tools List
Selection Guide
Quick Start

Overview¶

Deployment platforms provide: - Model Serving: High-performance inference APIs - Quantization: Reduce model size and cost - Scalability: Auto-scaling and load balancing - Edge Deployment: Run on CPUs, mobile, browsers

Tools List¶

Repo	Description	Stars
vllm-project/vllm	High-throughput and memory-efficient LLM serving
ollama/ollama	Run LLMs locally with ease
huggingface/text-generation-inference	Production-ready LLM serving from HuggingFace
ggerganov/llama.cpp	LLM inference in C/C++ for CPU and edge
bentoml/OpenLLM	Open platform for operating LLMs in production
skypilot-org/skypilot	Run LLMs on any cloud with one command
neuralmagic/nm-vllm	Optimized vLLM with sparsity support
triton-inference-server/server	NVIDIA's production inference serving

Selection Guide¶

By Use Case¶

☁️ Cloud Production - vLLM - Best throughput and efficiency - Text Generation Inference - HuggingFace ecosystem - BentoML - MLOps platform - Triton - Multi-model, enterprise

💻 Local Development - Ollama - Easiest setup, Mac/Windows/Linux - llama.cpp - Maximum compatibility - OpenLLM - Python-native

📱 Edge/Mobile - llama.cpp - CPU optimized - ONNX Runtime - Cross-platform - MLC LLM - Mobile and WebGPU

💰 Cost Optimization - vLLM - PagedAttention memory savings - llama.cpp - Quantization (4-bit, 8-bit) - NeuralMagic - Sparse models

By Scale¶

Small (<100 req/min) - Ollama, llama.cpp, OpenLLM

Medium (100-1000 req/min) - vLLM, Text Generation Inference

Large (1000+ req/min) - vLLM + multi-GPU - Triton + orchestration - SkyPilot + cloud scaling

Quick Start¶

vLLM - Production Serving¶

# Install
pip install vllm

# Serve model
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 2 \  # Use 2 GPUs
    --max-model-len 4096

# API compatible with OpenAI
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "What is AI?",
        "max_tokens": 100
    }'

Python Client:

from openai import OpenAI

# Point to vLLM server
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ]
)

print(response.choices[0].message.content)

Ollama - Local Models¶

# Install (Mac)
brew install ollama

# Or Linux
curl -fsSL https://ollama.com/install.sh | sh

# Run model
ollama run llama2

# Or specific model
ollama run codellama:13b

# API server (runs automatically)
curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}'

Python Client:

import ollama

# Chat
response = ollama.chat(
    model='llama2',
    messages=[
        {'role': 'user', 'content': 'Why is the sky blue?'}
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama2',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
):
    print(chunk['message']['content'], end='')

Text Generation Inference (TGI)¶

# Docker
docker run --gpus all --shm-size 1g -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --num-shard 2 \  # Multi-GPU
    --max-total-tokens 4096

# Or with quantization
docker run --gpus all -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-7B-Chat-GPTQ \
    --quantize gptq

Python Client:

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://localhost:8080")

# Generate
text = client.text_generation(
    prompt="What is the capital of France?",
    max_new_tokens=100
)

# Stream
for token in client.text_generation(
    prompt="Write a poem about AI",
    max_new_tokens=200,
    stream=True
):
    print(token, end='')

llama.cpp - CPU Inference¶

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

# Run
./main -m llama-2-7b-chat.Q4_K_M.gguf -p "What is AI?" -n 128

# API server
./server -m llama-2-7b-chat.Q4_K_M.gguf --port 8080

Python Binding:

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./llama-2-7b-chat.Q4_K_M.gguf",
    n_ctx=2048,  # Context window
    n_threads=8  # CPU cores
)

# Generate
output = llm(
    "Q: What is AI? A:",
    max_tokens=100,
    stop=["Q:", "\n"],
    echo=True
)

print(output['choices'][0]['text'])

BentoML - Production Platform¶

# bentofile.yaml
service: "llm_service.py:svc"
python:
  packages:
    - vllm
    - transformers

# llm_service.py
import bentoml
from vllm import LLM

@bentoml.service
class LLMService:
    def __init__(self):
        self.llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

    @bentoml.api
    def generate(self, prompt: str) -> str:
        outputs = self.llm.generate([prompt])
        return outputs[0].outputs[0].text

# Build
bentoml build

# Serve locally
bentoml serve llm_service:latest

# Deploy to cloud
bentoml containerize llm_service:latest
docker push myregistry/llm_service:latest

# Or deploy to BentoCloud
bentoml deploy llm_service:latest

SkyPilot - Multi-Cloud¶

# llama.yaml
resources:
  cloud: aws  # or gcp, azure
  accelerators: A100:2
  disk_size: 512

setup: |
  pip install vllm

run: |
  python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 2

# Launch
sky launch llama.yaml

# Auto-scale
sky autoscale llama.yaml --min-nodes 1 --max-nodes 5

# Spot instances (cheaper)
sky launch llama.yaml --use-spot

Performance Benchmarks¶

Throughput (Llama 2 7B, A100 GPU)¶

Framework	Tokens/sec	Batch Size	Latency (P50)
vLLM	4,200	128	15ms
TGI	3,800	96	18ms
Triton	3,500	64	20ms
BentoML	3,200	64	22ms
Transformers (baseline)	800	8	80ms

Memory Efficiency (Llama 2 70B)¶

Method	GPU Memory	GPUs Needed	Cost/Hour
FP16 (baseline)	140GB	2× A100 (80GB)	$6.00
vLLM (PagedAttention)	90GB	2× A100 (80GB)	$6.00
GPTQ 4-bit	40GB	1× A100 (80GB)	$3.00
AWQ 4-bit	38GB	1× A100 (80GB)	$3.00
llama.cpp Q4_K_M	35GB	1× A100 (80GB)	$3.00

CPU Performance (Llama 2 7B)¶

Quantization	Size	RAM	Tokens/sec (16 cores)
FP16	13GB	16GB	2
Q8	7GB	8GB	5
Q4_K_M	4GB	6GB	10
Q4_K_S	3.5GB	5GB	12
Q3_K_M	3GB	4GB	15

Quantization Guide¶

GGUF Formats (llama.cpp)¶

Q2_K - 2.5-3.0 bpw (bits per weight) - Smallest, lowest quality
Q3_K_S - ~3.5 bpw - Small, low quality
Q3_K_M - ~3.9 bpw - Medium quality
Q4_K_S - ~4.0 bpw - Small, reasonable quality
Q4_K_M - ~4.8 bpw - Recommended (best quality/size)
Q5_K_S - ~5.0 bpw - Large, better quality
Q5_K_M - ~5.5 bpw - Large, very good quality
Q6_K - ~6.0 bpw - Very large, excellent quality
Q8_0 - ~8.5 bpw - Huge, best quality

Recommendation: Start with Q4_K_M, use Q5_K_M if quality issues

GPTQ (GPU Quantization)¶

from transformers import AutoModelForCausalLM

# Load GPTQ model (4-bit)
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-Chat-GPTQ",
    device_map="auto",
    trust_remote_code=False,
    revision="main"
)

# Inference
inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)

AWQ (Activation-aware Weight Quantization)¶

from awq import AutoAWQForCausalLM

# Load AWQ model
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-Chat-AWQ",
    fuse_layers=True
)

# 2-3x faster than GPTQ with similar quality

Scaling Patterns¶

1. Vertical Scaling (Bigger GPU)¶

# Single large GPU
Model: Llama 2 70B
GPU: A100 80GB
Quantization: 4-bit GPTQ
Batch Size: 32
Throughput: ~1000 tokens/sec
Cost: $3/hour

2. Horizontal Scaling (Multi-GPU)¶

# vLLM with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 4 \  # Split across 4 GPUs
    --pipeline-parallel-size 1

# Model sharded across GPUs
# Linear scaling up to 8 GPUs

3. Load Balancing (Multiple Replicas)¶

# Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-serving
spec:
  replicas: 4  # 4 instances
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        resources:
          limits:
            nvidia.com/gpu: 1

---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  type: LoadBalancer
  selector:
    app: llm-serving
  ports:
  - port: 8000

4. Auto-Scaling¶

# SkyPilot auto-scale
sky launch llm.yaml \
    --min-nodes 2 \
    --max-nodes 10 \
    --scale-up-threshold 0.7 \  # CPU/GPU usage
    --scale-down-timeout 300  # 5 min cooldown

API Optimization¶

1. Batching¶

# vLLM automatic batching
# Just send concurrent requests!

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

async def generate(prompt):
    response = await client.chat.completions.create(
        model="llama-2-7b",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# vLLM automatically batches these
results = await asyncio.gather(*[
    generate(f"Prompt {i}") for i in range(100)
])

2. Streaming¶

# Better UX with streaming
for chunk in client.chat.completions.create(
    model="llama-2-7b",
    messages=[{"role": "user", "content": "Write an essay"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='')

3. Caching¶

# Prompt caching (vLLM)
import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_generate(prompt_hash, prompt):
    return llm.generate(prompt)

prompt = "Explain quantum computing"
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
result = cached_generate(prompt_hash, prompt)

Monitoring & Debugging¶

vLLM Metrics¶

# Enable Prometheus metrics
python -m vllm.entrypoints.openai.api_server \
    --model llama-2-7b \
    --enable-metrics

# Metrics at http://localhost:8000/metrics

# Key metrics:
# - vllm:num_requests_running
# - vllm:gpu_cache_usage_perc
# - vllm:avg_generation_throughput_toks_per_s
# - vllm:avg_time_to_first_token_seconds

Ollama Logs¶

# View logs
ollama logs

# Specific model
ollama logs llama2

# Debug mode
OLLAMA_DEBUG=1 ollama run llama2

Health Checks¶

# vLLM health endpoint
import requests

response = requests.get("http://localhost:8000/health")
assert response.status_code == 200

# Check model is loaded
response = requests.get("http://localhost:8000/v1/models")
models = response.json()
assert "llama-2-7b" in [m["id"] for m in models["data"]]

Cost Optimization Strategies¶

1. Right-Sizing¶

Don't use Llama 2 70B when 7B suffices!

Task: Simple Q&A
❌ Llama 2 70B: $0.01/1k tokens
✅ Llama 2 7B: $0.002/1k tokens
Savings: 80%

Task: Complex reasoning
✅ Llama 2 70B or Claude Opus

2. Spot Instances¶

# SkyPilot with spot
sky launch llm.yaml --use-spot

# AWS A100: $3/hr → $0.90/hr (70% savings)
# Automatic failover if preempted

3. Batch Processing¶

# Don't: 1000 sequential requests
for prompt in prompts:
    generate(prompt)  # $10, 10 minutes

# Do: Single batched request
generate_batch(prompts)  # $3, 2 minutes

4. Quantization¶

Llama 2 70B FP16: 2× A100 80GB = $6/hour
Llama 2 70B GPTQ 4-bit: 1× A100 80GB = $3/hour

Savings: 50% with minimal quality loss

Best Practices¶

1. Model Selection¶

# Decision tree
if task == "simple_classification":
    model = "distilbert-base"  # Small, fast, cheap
elif task == "general_chat":
    model = "llama-2-7b"  # Good quality, reasonable cost
elif task == "complex_reasoning":
    model = "llama-2-70b"  # or Claude Opus via API
elif task == "code_generation":
    model = "codellama-13b"  # Specialized

2. Context Management¶

# Don't: Send entire chat history every time
messages = all_previous_messages + [new_message]  # ❌ Expensive

# Do: Sliding window or summarization
messages = recent_messages[-5:] + [new_message]  # ✅ Efficient

3. Error Handling¶

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
def robust_generate(prompt):
    try:
        return llm.generate(prompt)
    except Exception as e:
        logger.error(f"Generation failed: {e}")
        raise

Infrastructure - Training and optimization
Observability - Monitor serving metrics
Production AI - Production patterns
Vector Databases - RAG deployments

← Back to Home | Next: Production AI →

🚀 Deployment & Serving¶

📋 Table of Contents¶

Overview¶

Tools List¶

Selection Guide¶

By Use Case¶

By Scale¶

Quick Start¶

vLLM - Production Serving¶

Ollama - Local Models¶

Text Generation Inference (TGI)¶

llama.cpp - CPU Inference¶

BentoML - Production Platform¶

SkyPilot - Multi-Cloud¶

Performance Benchmarks¶

Throughput (Llama 2 7B, A100 GPU)¶

Memory Efficiency (Llama 2 70B)¶

CPU Performance (Llama 2 7B)¶

Quantization Guide¶

GGUF Formats (llama.cpp)¶

GPTQ (GPU Quantization)¶

AWQ (Activation-aware Weight Quantization)¶

Scaling Patterns¶

1. Vertical Scaling (Bigger GPU)¶

2. Horizontal Scaling (Multi-GPU)¶

3. Load Balancing (Multiple Replicas)¶

4. Auto-Scaling¶

API Optimization¶

1. Batching¶

2. Streaming¶

3. Caching¶

Monitoring & Debugging¶

vLLM Metrics¶

Ollama Logs¶

Health Checks¶

Cost Optimization Strategies¶

1. Right-Sizing¶

2. Spot Instances¶

3. Batch Processing¶

4. Quantization¶

Best Practices¶

1. Model Selection¶

2. Context Management¶

3. Error Handling¶

Related Resources¶