Generate scaled synthetic datasets for RAG evaluation
The Problem with RAG Evaluation
- Data pollution: Benchmark datasets are in training data. Foundation models have seen MS MARCO, BeIR. You’re not testing retrieval, you’re testing memorization.
- High-fidelity filtering: Production RAG needs complex metadata filters. Date ranges, nested categories, numerical thresholds. Existing datasets have a category field and maybe some tags.
So I built this. Generate complete RAG evaluation datasets from a single text prompt. Fresh synthetic data at any scale you need.
This lets you test what actually matters:
- → RAG systems without training data contamination
- → How vector databases handle complex filters
- → Pre-filter vs post-filter performance
- → Retrieval qualit…
Generate scaled synthetic datasets for RAG evaluation
The Problem with RAG Evaluation
- Data pollution: Benchmark datasets are in training data. Foundation models have seen MS MARCO, BeIR. You’re not testing retrieval, you’re testing memorization.
- High-fidelity filtering: Production RAG needs complex metadata filters. Date ranges, nested categories, numerical thresholds. Existing datasets have a category field and maybe some tags.
So I built this. Generate complete RAG evaluation datasets from a single text prompt. Fresh synthetic data at any scale you need.
This lets you test what actually matters:
- → RAG systems without training data contamination
- → How vector databases handle complex filters
- → Pre-filter vs post-filter performance
- → Retrieval quality degradation with corpus size
- → Metadata selectivity edge cases
Generate a Dataset
$ dataset-factory generate \
--prompt "A gold rush town in the Yukon during the 1890s" \
--documents 1000 \
--queries 100 \
--output output/goldrush
1M+ documents supported
7-25 document types
$0.46 per 1K docs (Llama 3.1 8B via Groq)
How It Works
1. Config Generation
LLM analyzes your prompt and creates a schema. Document types, metadata fields, value distributions.
2. World Building
Generates 2000 words of domain context. History, entities, terminology, relationships. Used for all documents.
3. Document Generation
Each doc gets random metadata from config. LLM generates unique content. No templates. 5 prompt variations for diversity.
4. Query Generation
Analyzes dataset statistics. Picks filter selectivity. Generates queries from actual document content with known ground truth.
Features
Variable Length
400-40,000 token documents. Short reports to comprehensive audits. Realistic length distributions.
Rich Metadata
Temporal, categorical, numerical, hierarchical fields. Zipfian and uniform distributions. Precise filter control.
Selective Queries
Control selectivity from 0.1% (ultra-specific) to 10%+ (broad). Test pre-filter vs post-filter performance.
Cost Tracking
Real-time cost monitoring per phase. Detailed breakdowns. Works across resume sessions.
Resumable
Pause and resume at any time. Streams to JSONL. Memory efficient at any scale.
Multi-LLM
Groq, Gemini, OpenAI, Anthropic. Smart rate limiting. Auto-concurrency adjustment.
Risks
Using an LLM to generate eval data for LLM systems is weird. Hallucination is the goal here, which feels like an anti-pattern.
Biggest risk: Similar documents. Even with high temperature and prompt variations, you might get semantically identical documents. A hundred prospector journals that all sound the same.
Internal consistency: No guarantee the LLM maintains coherent facts across thousands of documents. It might contradict itself.
But: As long as you use the same dataset to compare multiple systems, the comparison is still fair. Weird artifacts affect all systems equally. You’re measuring relative performance, not absolute quality on some perfect benchmark.
Example Datasets
$ --prompt "Yukon gold rush town during the 1890s"
$ --prompt "Dystopian tech megacorp with surveillance and AI incidents"
$ --prompt "Biomedical research papers and clinical trials"
$ --prompt "Legal contracts and case law from various jurisdictions"
$ --prompt "Product listings with reviews and specifications"
Get Started
$ uv pip install dataset-factory
$ echo "GROQ_API_KEY=your_key" > .env
$
$ dataset-factory generate \
--prompt "your domain description" \
--documents 1000 \
--queries 100 \
--output output/my_dataset