⚙️ Infrastructure¶
Training, inference optimization, and systems for scaling AI workloads.
📋 Table of Contents¶
Overview¶
Infrastructure tools power the training and serving of modern AI models. This collection includes frameworks for: - Training: Distributed training, optimization, fine-tuning - Inference: Efficient model serving, quantization - Scalability: Multi-GPU, distributed systems
Tools List¶
| Repo | Description | Stars |
|---|---|---|
| pytorch/pytorch | Deep learning framework powering most modern LLMs | |
| Lightning-AI/pytorch-lightning | High-level PyTorch training framework | |
| ggerganov/llama.cpp | LLM inference in pure C/C++ for CPU and edge devices | |
| microsoft/DeepSpeed | Distributed training system for trillion-parameter models | |
| unslothai/unsloth | 2-5x faster LLM fine-tuning with 80% less memory | |
| huggingface/trl | Transformer reinforcement learning for RLHF | |
| vllm-project/vllm | High-throughput and memory-efficient LLM serving |
Selection Guide¶
By Use Case¶
🎯 Training From Scratch - PyTorch - Industry standard, best ecosystem - Lightning - High-level abstraction, best practices built-in - DeepSpeed - For massive models (>100B parameters)
⚡ Fine-Tuning - Unsloth - Fastest, most memory-efficient (2-5x speedup) - TRL - For RLHF and preference learning - Lightning - For production-grade fine-tuning pipelines
🚀 Inference & Serving - vLLM - Best throughput, production-grade - llama.cpp - CPU/edge deployment, quantization - Lightning - End-to-end deployment
📊 Distributed Training - DeepSpeed - Largest models, ZeRO optimization - Lightning - Multi-GPU, multi-node made easy - PyTorch DDP - Built-in distributed training
Quick Start¶
Training with PyTorch Lightning¶
import pytorch_lightning as pl
from transformers import AutoModelForCausalLM
class LLMFineTuner(pl.LightningModule):
def __init__(self):
super().__init__()
self.model = AutoModelForCausalLM.from_pretrained("model-name")
def training_step(self, batch, batch_idx):
outputs = self.model(**batch)
return outputs.loss
trainer = pl.Trainer(accelerator="gpu", devices=4)
trainer.fit(model)
Fine-Tuning with Unsloth¶
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b",
max_seq_length = 2048,
load_in_4bit = True,
)
# 2-5x faster training!
trainer = SFTTrainer(model=model, ...)
trainer.train()
Inference with vLLM¶
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 4
CPU Inference with llama.cpp¶
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Run
./main -m models/llama-2-7b.gguf -p "Hello, world!"
Performance Comparison¶
Training Speed (Llama 7B, Single GPU)¶
| Framework | Tokens/sec | Memory (GB) |
|---|---|---|
| PyTorch (baseline) | 1000 | 24 |
| Lightning | 1050 | 23 |
| Unsloth | 2500 | 12 |
| DeepSpeed ZeRO-3 | 1200 | 16 |
Inference Throughput (Llama 7B)¶
| Framework | Tokens/sec | Latency (ms) |
|---|---|---|
| Transformers | 50 | 200 |
| vLLM | 250 | 40 |
| llama.cpp (CPU) | 30 | 333 |
| llama.cpp (Metal) | 120 | 83 |
Related Resources¶
- Foundation Models - Choose a model to train
- Deployment & Serving - Production deployment
- Observability - Monitor training runs