Introducing: FTS5 ICU Tokenizer for Better Multilingual Text Search
github.com·1d·
Discuss: r/sqlite

FTS5 ICU Tokenizer for SQLite

This project provides custom FTS5 tokenizers for SQLite that use the International Components for Unicode (ICU) library to provide robust word segmentation for various languages.

It is written in C for maximum stability and performance, making it suitable for high-availability systems. The target locale is configurable at build time, with support for both universal and locale-specific tokenizers.

Prerequisites

Before you begin, ensure you have the following installed on your system:

  • CMake (version 3.10 or higher)
  • A C Compiler (GCC, Clang, or MSVC)
  • SQLite3 development libraries (libsqlite3-dev on Debian/Ubuntu, sqlite-devel on Fedora/CentOS)
  • ICU development libraries (libicu-dev on Debian/Ubuntu, `libicu-dev…

Similar Posts

Loading similar posts...