GitHub - purijs/fasttfidf: High-performance TF-IDF vectorization for large-scale text datasets that exceed available memory.
github.com·1d·
Discuss: Hacker News
🔬AI
Preview
Report Post

fasttfidf

fasttfidf is a Python library that provides TF-IDF vectorization with automatic memory management and SIMD acceleration. It processes datasets larger than RAM using memory-mapped files and streaming architecture, making it practical to work with multi-gigabyte text corpora on commodity hardware

Key Features

  • Memory-efficient processing: Handles datasets larger than available RAM through streaming and automatic memory management
  • SIMD optimization: Leverages AVX2 (x86_64) and NEON (ARM) instruction sets for accelerated text processing
  • Multiprocessing: Parallel vocabulary building across CPU cores with automatic load balancing
  • Batch training support: Train models incrementally without loading full dataset into memory
  • Vocabulary exploration

Similar Posts

Loading similar posts...