📦 Datasets & Data Tools¶
Tools for creating, managing, and curating datasets for LLM training, fine-tuning, and evaluation.
📋 Table of Contents¶
Overview¶
Dataset tools enable: - Synthetic Data Generation: Create training data with LLMs - Data Annotation: Label and curate datasets - Dataset Management: Version, store, and share datasets - Data Quality: Cleaning, deduplication, filtering
Tools List¶
| Repo | Description | Stars |
|---|---|---|
| argilla-io/argilla | Collaboration platform for AI engineers and domain experts | |
| confident-ai/deepeval | LLM evaluation framework with synthetic data generation | |
| gretelai/gretel-synthetics | Generate synthetic datasets with privacy guarantees | |
| microsoft/promptbase | Manage and version prompts and datasets | |
| huggingface/datasets | Fast, efficient dataset library for ML | |
| cleanlab/cleanlab | Find and fix label errors in datasets | |
| bytewax/bytewax | Python stream processing for real-time data |
Selection Guide¶
By Use Case¶
🏷️ Data Annotation - Argilla - Best UI, team collaboration - Label Studio (not listed - proprietary focus) - Prodigy (Spacy ecosystem)
🤖 Synthetic Data - DeepEval - LLM-specific test cases - Gretel Synthetics - Privacy-preserving - GPT-4/Claude - Custom generation scripts
📊 Dataset Management - HuggingFace Datasets - Largest ecosystem - PromptBase - Prompt versioning - DVC - Version control for data
🔍 Data Quality - Cleanlab - Find label errors - Great Expectations - Data validation - Argilla - Human-in-the-loop curation
By Team Size¶
Solo Developer - HuggingFace Datasets - Ready-to-use - DeepEval - Quick synthetic data - Cleanlab - Automated quality
Small Team (2-10) - Argilla - Collaborative annotation - PromptBase - Shared prompts - Gretel - Synthetic data at scale
Enterprise - Argilla Enterprise - Advanced features - HuggingFace Hub - Private datasets - Custom infrastructure - Full control
Quick Start¶
Argilla - Annotation Platform¶
# Install
pip install argilla
# Run server (Docker)
docker run -d --name argilla -p 6900:6900 argilla/argilla-quickstart:latest
# Or use cloud: https://argilla.io
import argilla as rg
# Initialize
rg.init(api_url="http://localhost:6900", api_key="admin.apikey")
# Create dataset for text classification
dataset = rg.FeedbackDataset(
fields=[
rg.TextField(name="text"),
],
questions=[
rg.LabelQuestion(
name="sentiment",
labels=["positive", "negative", "neutral"]
),
]
)
# Add records
dataset.add_records([
rg.FeedbackRecord(
fields={"text": "This product is amazing!"}
),
rg.FeedbackRecord(
fields={"text": "Terrible experience."}
)
])
# Push to Argilla
dataset.push_to_argilla("sentiment-analysis")
# Team annotates via UI
# Then retrieve annotated data
annotated = rg.load("sentiment-analysis")
DeepEval - Synthetic Test Data¶
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import Golden
# Initialize
synthesizer = Synthesizer(
model="gpt-4"
)
# Generate synthetic test cases
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["./docs"],
num_goldens=50,
include_expected_output=True
)
# Each golden contains:
# - input: Test question
# - expected_output: Ground truth answer
# - context: Source document chunks
# Use for evaluation
from deepeval.metrics import AnswerRelevancy
metric = AnswerRelevancy()
for golden in goldens:
score = metric.measure(
output=model_response,
expected_output=golden.expected_output
)
Gretel Synthetics - Privacy-Safe Data¶
from gretel_synthetics.batch import DataFrameBatch
import pandas as pd
# Load sensitive data
df = pd.read_csv("customer_data.csv")
# Configure synthetic model
config = {
"epochs": 100,
"learning_rate": 0.001,
"vocab_size": 20000,
"privacy_filters": True, # Remove PII
"differential_privacy": {
"enabled": True,
"epsilon": 8.0 # Privacy budget
}
}
# Train and generate
batch = DataFrameBatch(df, config)
batch.train()
# Generate synthetic data
synthetic_df = batch.generate(num_records=10000)
# Synthetic data:
# - Maintains statistical properties
# - Removes PII
# - Differential privacy guarantees
HuggingFace Datasets¶
from datasets import load_dataset, Dataset
# Load popular datasets
squad = load_dataset("squad")
glue = load_dataset("glue", "mrpc")
# Load your own
dataset = Dataset.from_dict({
"text": ["Example 1", "Example 2"],
"label": [0, 1]
})
# Preprocess
def tokenize(examples):
return tokenizer(examples["text"], padding=True, truncation=True)
tokenized = dataset.map(tokenize, batched=True)
# Save and share
dataset.push_to_hub("username/dataset-name")
# Version control
dataset.save_to_disk("./my_dataset_v1")
Cleanlab - Find Label Errors¶
from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression
# Your noisy labeled data
X_train, y_train = load_data()
# Train with automatic error detection
clf = CleanLearning(LogisticRegression())
clf.fit(X_train, y_train)
# Find label errors
label_errors = clf.get_label_issues()
# Review and fix
for idx in label_errors:
print(f"Sample {idx}:")
print(f" Data: {X_train[idx]}")
print(f" Given label: {y_train[idx]}")
print(f" Suggested: {clf.predict([X_train[idx]])[0]}")
# For LLMs - use embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)
clf = CleanLearning(LogisticRegression())
clf.fit(embeddings, labels)
errors = clf.get_label_issues()
Synthetic Data Generation Patterns¶
1. Question-Answer Pairs¶
from openai import OpenAI
client = OpenAI()
def generate_qa_pairs(document, num_pairs=10):
prompt = f"""Generate {num_pairs} question-answer pairs from this document:
{document}
Format:
Q: <question>
A: <answer>
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Parse and structure
qa_pairs = parse_qa_response(response.choices[0].message.content)
return qa_pairs
# Use for RAG evaluation
qa_dataset = []
for doc in documents:
qa_dataset.extend(generate_qa_pairs(doc))
2. Few-Shot Examples¶
def generate_training_examples(task_description, seed_examples, num_examples=100):
prompt = f"""Task: {task_description}
Examples:
{format_examples(seed_examples)}
Generate {num_examples} more examples following the same pattern."""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.8 # Higher for diversity
)
return parse_examples(response.choices[0].message.content)
# Example: Sentiment analysis
examples = generate_training_examples(
task_description="Classify sentiment as positive, negative, or neutral",
seed_examples=[
{"text": "I love this!", "label": "positive"},
{"text": "This is terrible", "label": "negative"},
],
num_examples=1000
)
3. Data Augmentation¶
def augment_text(text, methods=["paraphrase", "backtranslation"]):
augmented = []
if "paraphrase" in methods:
prompt = f"Paraphrase this 5 different ways:\n{text}"
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
augmented.extend(parse_paraphrases(response.choices[0].message.content))
if "backtranslation" in methods:
# Translate to French and back to English
french = translate(text, "fr")
english = translate(french, "en")
augmented.append(english)
return augmented
# Expand small dataset
expanded_dataset = []
for text, label in original_dataset:
expanded_dataset.append((text, label))
for aug_text in augment_text(text):
expanded_dataset.append((aug_text, label))
4. Adversarial Examples¶
def generate_adversarial(text, label, model):
prompt = f"""Generate text that:
1. Should be labeled as: {label}
2. But is likely to fool this classifier: {model.description}
3. Is similar to: {text}
Generate 5 challenging examples."""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return parse_examples(response.choices[0].message.content)
# Use to find model weaknesses
adversarial_set = []
for text, label in test_set:
adversarial_set.extend(generate_adversarial(text, label, classifier))
Popular Public Datasets¶
LLM Training & Fine-Tuning¶
| Dataset | Description | Size | License |
|---|---|---|---|
| C4 | Common Crawl cleaned | 750GB | ODC-BY |
| The Pile | Diverse text dataset | 825GB | MIT |
| RedPajama | LLaMA recreation | 1.2TB | Apache 2.0 |
| ROOTS | Multilingual (59 langs) | 1.6TB | Various |
| Dolma | Open pre-training corpus | 3TB | ODC-BY |
Instruction Tuning¶
| Dataset | Examples | Type | License |
|---|---|---|---|
| Alpaca | 52K | Instruction-following | CC BY-NC 4.0 |
| ShareGPT | 90K | Conversations | Non-commercial |
| FLAN | 1.8M | Multi-task | Apache 2.0 |
| OpenOrca | 4.2M | GPT-4 augmented | MIT |
| UltraChat | 1.5M | Multi-turn dialogues | MIT |
Evaluation¶
| Benchmark | Tasks | Focus | Link |
|---|---|---|---|
| MMLU | 57 | Knowledge | HF: hendrycks/mmlu |
| BBH | 23 | Reasoning | HF: lukaemon/bbh |
| TruthfulQA | 817 | Truthfulness | HF: truthful_qa |
| HumanEval | 164 | Coding | HF: openai_humaneval |
| GSM8K | 8.5K | Math | HF: gsm8k |
RAG & Knowledge¶
| Dataset | Documents | Domain | Link |
|---|---|---|---|
| Wikipedia | 6.4M | General | HF: wikipedia |
| arXiv | 2.3M | Academic | Kaggle |
| PubMed | 35M | Medical | NIH |
| GitHub Code | 115M | Programming | Google BigQuery |
Data Quality Checklist¶
1. Deduplication¶
from datasets import load_dataset
dataset = load_dataset("your_dataset")
# Exact duplicates
deduplicated = dataset.unique("text")
# Near duplicates (using MinHash)
from datasketch import MinHash, MinHashLSH
lsh = MinHashLSH(threshold=0.9, num_perm=128)
# ... implement near-duplicate detection
2. PII Removal¶
import presidio_analyzer
import presidio_anonymizer
analyzer = presidio_analyzer.AnalyzerEngine()
anonymizer = presidio_anonymizer.AnonymizerEngine()
def remove_pii(text):
results = analyzer.analyze(text, language='en')
anonymized = anonymizer.anonymize(text, results)
return anonymized.text
# Apply to dataset
dataset = dataset.map(lambda x: {"text": remove_pii(x["text"])})
3. Quality Filtering¶
def quality_filter(example):
text = example["text"]
# Minimum length
if len(text) < 100:
return False
# Language detection
if detect_language(text) != "en":
return False
# No excessive repetition
if has_repetition(text, threshold=0.3):
return False
# Toxicity check
if is_toxic(text):
return False
return True
filtered_dataset = dataset.filter(quality_filter)
4. Balance Check¶
from collections import Counter
# Check label distribution
label_counts = Counter(dataset["label"])
print(label_counts)
# Rebalance if needed
from imblearn.over_sampling import SMOTE
# For classification datasets
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
Dataset Versioning¶
DVC (Data Version Control)¶
# Install
pip install dvc
# Initialize
dvc init
# Track dataset
dvc add data/training_set.csv
# Commit
git add data/training_set.csv.dvc .gitignore
git commit -m "Add training dataset v1"
# Store remotely
dvc remote add -d storage s3://mybucket/dvc-storage
dvc push
# Later: Retrieve specific version
git checkout v1.0
dvc pull
HuggingFace Hub Versioning¶
from datasets import load_dataset
# Upload with version tag
dataset.push_to_hub(
"username/dataset-name",
commit_message="v2.0: Added 10k examples",
revision="v2.0"
)
# Load specific version
dataset_v1 = load_dataset("username/dataset-name", revision="v1.0")
dataset_v2 = load_dataset("username/dataset-name", revision="v2.0")
Annotation Workflows¶
Active Learning with Argilla¶
import argilla as rg
from transformers import pipeline
# Initial model
classifier = pipeline("text-classification", model="distilbert-base")
# Predict on unlabeled pool
predictions = []
for text in unlabeled_pool:
pred = classifier(text)[0]
predictions.append({
"text": text,
"prediction": pred["label"],
"confidence": pred["score"]
})
# Send low-confidence samples to Argilla
uncertain = [p for p in predictions if p["confidence"] < 0.6]
dataset = rg.FeedbackDataset.from_argilla("annotation-queue")
dataset.add_records([
rg.FeedbackRecord(fields={"text": item["text"]})
for item in uncertain
])
dataset.push_to_argilla("annotation-queue")
# Humans annotate uncertain cases
# Then retrain model with new labels
Cost Optimization¶
Synthetic Data vs Human Annotation¶
| Method | Cost per Example | Quality | Scale |
|---|---|---|---|
| GPT-4 Generation | $0.02 | Good | Unlimited |
| GPT-3.5 Generation | $0.001 | Moderate | Unlimited |
| Expert Annotation | \(5-\)20 | Excellent | Limited |
| Crowd Annotation | \(0.10-\)0.50 | Variable | High |
| Hybrid (Synthetic + Review) | $0.20 | Very Good | High |
Recommendation: Generate with GPT-4, review with humans (10% sample)
Best Practices¶
1. Diverse Data Sources¶
# Mix multiple sources
combined = concatenate_datasets([
web_crawl,
books,
academic_papers,
conversations,
code_repos
])
2. Held-Out Test Sets¶
# Never train on test data!
train_test_split = dataset.train_test_split(test_size=0.1, seed=42)
train = train_test_split["train"]
test = train_test_split["test"]
# Save test set separately
test.save_to_disk("./test_set_final")
# Never modify after creation!
3. Document Everything¶
# dataset_card.yaml
dataset_name: "My Dataset"
version: "1.0.0"
creation_date: "2026-04-17"
size: "100k examples"
languages: ["en"]
license: "MIT"
sources:
- "Web crawl (30%)"
- "Synthetic GPT-4 (50%)"
- "Human annotations (20%)"
preprocessing:
- "Deduplication"
- "PII removal"
- "Quality filtering"
splits:
train: 80000
validation: 10000
test: 10000
Related Resources¶
- Evaluation & Testing - Test with your datasets
- Observability - Monitor data quality
- Agents & Orchestration - Automate data pipelines
- Production AI - Deploy data pipelines