⚙️ Infrastructure¶

Training, inference optimization, and systems for scaling AI workloads.

📋 Table of Contents¶

Overview
Tools List
Selection Guide
Quick Start

Overview¶

Infrastructure tools power the training and serving of modern AI models. This collection includes frameworks for: - Training: Distributed training, optimization, fine-tuning - Inference: Efficient model serving, quantization - Scalability: Multi-GPU, distributed systems

Tools List¶

Repo	Description	Stars
pytorch/pytorch	Deep learning framework powering most modern LLMs
Lightning-AI/pytorch-lightning	High-level PyTorch training framework
ggerganov/llama.cpp	LLM inference in pure C/C++ for CPU and edge devices
microsoft/DeepSpeed	Distributed training system for trillion-parameter models
unslothai/unsloth	2-5x faster LLM fine-tuning with 80% less memory
huggingface/trl	Transformer reinforcement learning for RLHF
vllm-project/vllm	High-throughput and memory-efficient LLM serving

Selection Guide¶

By Use Case¶

🎯 Training From Scratch - PyTorch - Industry standard, best ecosystem - Lightning - High-level abstraction, best practices built-in - DeepSpeed - For massive models (>100B parameters)

⚡ Fine-Tuning - Unsloth - Fastest, most memory-efficient (2-5x speedup) - TRL - For RLHF and preference learning - Lightning - For production-grade fine-tuning pipelines

🚀 Inference & Serving - vLLM - Best throughput, production-grade - llama.cpp - CPU/edge deployment, quantization - Lightning - End-to-end deployment

📊 Distributed Training - DeepSpeed - Largest models, ZeRO optimization - Lightning - Multi-GPU, multi-node made easy - PyTorch DDP - Built-in distributed training

Quick Start¶

Training with PyTorch Lightning¶

import pytorch_lightning as pl
from transformers import AutoModelForCausalLM

class LLMFineTuner(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = AutoModelForCausalLM.from_pretrained("model-name")

    def training_step(self, batch, batch_idx):
        outputs = self.model(**batch)
        return outputs.loss

trainer = pl.Trainer(accelerator="gpu", devices=4)
trainer.fit(model)

Fine-Tuning with Unsloth¶

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b",
    max_seq_length = 2048,
    load_in_4bit = True,
)

# 2-5x faster training!
trainer = SFTTrainer(model=model, ...)
trainer.train()

Inference with vLLM¶

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --tensor-parallel-size 4

CPU Inference with llama.cpp¶

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run
./main -m models/llama-2-7b.gguf -p "Hello, world!"

Performance Comparison¶

Training Speed (Llama 7B, Single GPU)¶

Framework	Tokens/sec	Memory (GB)
PyTorch (baseline)	1000	24
Lightning	1050	23
Unsloth	2500	12
DeepSpeed ZeRO-3	1200	16

Inference Throughput (Llama 7B)¶

Framework	Tokens/sec	Latency (ms)
Transformers	50	200
vLLM	250	40
llama.cpp (CPU)	30	333
llama.cpp (Metal)	120	83

Foundation Models - Choose a model to train
Deployment & Serving - Production deployment
Observability - Monitor training runs

← Back to Home | Next: Agents & Orchestration →