Fast, Scalable LDA in C++ with Stochastic Variational Inference

Stochastic Variational Inference LDA in C++

This repository contains an optimized C++ implementation of Latent Dirichlet Allocation (LDA) using Stochastic Variational Inference (SVI) [1]. Designed for scale, it uses multithreading (OpenMP) and careful memory reuse, avoids wasteful allocations, and follows cache-friendly data paths for fast training on large corpora .This implementation was tested on the Wikipedia dataset with over 1 billion tokens, and the training takes only a few minutes (Using 200 topics on a 32-core Xeon (2.10GHz) machine with 512GB of RAM).

The benchmarking framework trains LDA models using SVI, exports model snapshots in a format compatible with MALLET [2], and then uses MALLET to compute log-likelihood and perplexity metrics for model eval…

Stochastic Variational Inference LDA in C++

Quick Start Guide

Downloading UCI Datasets

bash scripts/get_uci_datasets.sh

This downloads and prepares standard LDA benchmark datasets (KOS, NYTIMES, PUBMED).

Formatting the data in the format required by SVI

To convert them to the SVI format. Use the provided conversion tool (you may need to adjust the path according to where your data is): extras/lda_svi_formatting/src/ConvertDataToSVIFormat.cpp

Running the Experiments

1. Compile the SVI LDA Code

mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ ..
make clean && make

2. Execute Benchmarks

Run the automated benchmarking script:

bash scripts/run_svi_benchmark.sh

This script orchestrates the complete experimental pipeline, including model training, and evaluation across multiple datasets and configurations. You may modify this script to adjust the experiments to your needs.

Output

Results are saved in the benchmarks/csv/ directory. Metrics include:

loglik.csv: Log-likelihood over iterations
perplexity.csv: Perplexity scores
plotdata.csv: Perplexity over time in milliseconds

Project Structure

.
├── build/              # Build artifacts and compiled executable
├── benchmarks/         # Benchmark results and logs
│   ├── csv/           # Performance metrics (log-likelihood, perplexity)
│   └── logs/          # Execution time logs
├── conf/              # Configuration files
├── data/              # Training and test datasets
├── src/               # C++ source code
│   ├── main.cpp       # Entry point with argument parsing
|   ├── Svi_ldaP.cpp/h. #  Core SVI algorithm multithreaded
│   ├── ReadSparseData.cpp/h   # Data loading utilities
│   └── ...
├── scripts/           # Shell and Python scripts for benchmarking
├── external/          # External dependencies (MALLET)
├── extras/            # Additional tools and utilities
└── CMakeLists.txt     # CMake build configuration

References

[1] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. J. Mach. Learn. Res. 14, 1 (January 2013), 1303–1347. [2] Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu

Stochastic Variational Inference LDA in C++

Stochastic Variational Inference LDA in C++

Quick Start Guide

Downloading UCI Datasets

Formatting the data in the format required by SVI

Running the Experiments

1. Compile the SVI LDA Code

2. Execute Benchmarks

Output

Project Structure

References

Similar Posts