Stochastic Variational Inference LDA in C++
This repository contains an optimized C++ implementation of Latent Dirichlet Allocation (LDA) using Stochastic Variational Inference (SVI) [1]. Designed for scale, it uses multithreading (OpenMP) and careful memory reuse, avoids wasteful allocations, and follows cache-friendly data paths for fast training on large corpora .This implementation was tested on the Wikipedia dataset with over 1 billion tokens, and the training takes only a few minutes (Using 200 topics on a 32-core Xeon (2.10GHz) machine with 512GB of RAM).
The benchmarking framework trains LDA models using SVI, exports model snapshots in a format compatible with MALLET [2], and then uses MALLET to compute log-likelihood and perplexity metrics for model eval…
Stochastic Variational Inference LDA in C++
This repository contains an optimized C++ implementation of Latent Dirichlet Allocation (LDA) using Stochastic Variational Inference (SVI) [1]. Designed for scale, it uses multithreading (OpenMP) and careful memory reuse, avoids wasteful allocations, and follows cache-friendly data paths for fast training on large corpora .This implementation was tested on the Wikipedia dataset with over 1 billion tokens, and the training takes only a few minutes (Using 200 topics on a 32-core Xeon (2.10GHz) machine with 512GB of RAM).
The benchmarking framework trains LDA models using SVI, exports model snapshots in a format compatible with MALLET [2], and then uses MALLET to compute log-likelihood and perplexity metrics for model evaluation.
Quick Start Guide
Downloading UCI Datasets
bash scripts/get_uci_datasets.sh
This downloads and prepares standard LDA benchmark datasets (KOS, NYTIMES, PUBMED).
Formatting the data in the format required by SVI
To convert them to the SVI format. Use the provided conversion tool (you may need to adjust the path according to where your data is): extras/lda_svi_formatting/src/ConvertDataToSVIFormat.cpp
Running the Experiments
1. Compile the SVI LDA Code
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ ..
make clean && make
2. Execute Benchmarks
Run the automated benchmarking script:
bash scripts/run_svi_benchmark.sh
This script orchestrates the complete experimental pipeline, including model training, and evaluation across multiple datasets and configurations. You may modify this script to adjust the experiments to your needs.
Output
Results are saved in the benchmarks/csv/ directory. Metrics include:
loglik.csv: Log-likelihood over iterationsperplexity.csv: Perplexity scoresplotdata.csv: Perplexity over time in milliseconds
Project Structure
.
├── build/ # Build artifacts and compiled executable
├── benchmarks/ # Benchmark results and logs
│ ├── csv/ # Performance metrics (log-likelihood, perplexity)
│ └── logs/ # Execution time logs
├── conf/ # Configuration files
├── data/ # Training and test datasets
├── src/ # C++ source code
│ ├── main.cpp # Entry point with argument parsing
| ├── Svi_ldaP.cpp/h. # Core SVI algorithm multithreaded
│ ├── ReadSparseData.cpp/h # Data loading utilities
│ └── ...
├── scripts/ # Shell and Python scripts for benchmarking
├── external/ # External dependencies (MALLET)
├── extras/ # Additional tools and utilities
└── CMakeLists.txt # CMake build configuration
References
[1] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. J. Mach. Learn. Res. 14, 1 (January 2013), 1303–1347. [2] Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu