Introducing: FTS5 ICU Tokenizer for Better Multilingual Text Search

FTS5 ICU Tokenizer for SQLite

This project provides custom FTS5 tokenizers for SQLite that use the International Components for Unicode (ICU) library to provide robust word segmentation for various languages.

It is written in C for maximum stability and performance, making it suitable for high-availability systems. The target locale is configurable at build time, with support for both universal and locale-specific tokenizers.

Prerequisites

Before you begin, ensure you have the following installed on your system:

CMake (version 3.10 or higher)
A C Compiler (GCC, Clang, or MSVC)
SQLite3 development libraries (libsqlite3-dev on Debian/Ubuntu, sqlite-devel on Fedora/CentOS)
ICU development libraries (libicu-dev on Debian/Ubuntu, `libicu-dev…

FTS5 ICU Tokenizer for SQLite

This project provides custom FTS5 tokenizers for SQLite that use the International Components for Unicode (ICU) library to provide robust word segmentation for various languages.

Prerequisites

Before you begin, ensure you have the following installed on your system:

CMake (version 3.10 or higher)
A C Compiler (GCC, Clang, or MSVC)
SQLite3 development libraries (libsqlite3-dev on Debian/Ubuntu, sqlite-devel on Fedora/CentOS)
ICU development libraries (libicu-dev on Debian/Ubuntu, libicu-devel on Fedora/CentOS)

Building and Installing

This project uses a standard CMake build process. The target locale can be specified using the LOCALE variable.

Quick Build and Test

For convenience, this project includes scripts to build and test all supported locales:

# Build all tokenizers
./scripts/build_all.sh

# Test all tokenizers
./scripts/test_all.sh

For detailed information about building and testing, see docs/BUILD_TEST_README.md.

Building Individual Locales

You can build individual locales using CMake directly:

# Create build directory
mkdir build
cd build

# Configure for a specific locale (e.g., Japanese)
cmake .. -DLOCALE=ja

# Build
cmake --build .

# Install the library (optional)
sudo cmake --build . --target install

(Alternatively, on Linux/macOS, you can just run make and sudo make install)

Building on Windows

This project can be built on Windows using Visual Studio and CMake. Here’s how:

Prerequisites for Windows

Visual Studio 2022 with C++ development tools

Download from https://visualstudio.microsoft.com/
Ensure you install the “Desktop development with C++” workload

CMake 3.10 or higher

Download from https://cmake.org/download/
Ensure CMake is added to your system PATH during installation

SQLite pre-compiled binaries and source code
ICU4C pre-compiled binaries

Step 1: Download and Extract Dependencies

SQLite

Download the pre-compiled SQLite binaries:

Visit https://www.sqlite.org/download.html
Download the “Precompiled Binaries for Windows” (sqlite-dll-win64-x64-*.zip)
Extract to a directory of your choice (e.g., C:\sqlite)

Download the SQLite source code:

From the same page, download “Source Code” (sqlite-src-*.zip)
Extract to a directory of your choice (e.g., C:\sqlite-src)

ICU4C

Download pre-compiled ICU4C binaries:

Visit https://github.com/unicode-org/icu/releases
Download the latest Windows binaries (e.g., icu4c-*-Win64-msvc.zip)
Extract to a directory of your choice (e.g., C:\icu)

Step 2: Generate SQLite Header Files

Open “Developer PowerShell for VS 2022” (from Start Menu) - this is the recommended shell
Navigate to your SQLite source directory:

cd C:\sqlite-src

Generate the sqlite3.h header file:

nmake /f Makefile.msc sqlite3.h

Create an include directory in your SQLite binaries folder:

mkdir C:\sqlite\include

Copy the generated header files:

copy sqlite3.h C:\sqlite\include\sqlite3.h
copy src\sqlite3ext.h C:\sqlite\include\sqlite3ext.h

Step 3: Generate SQLite Import Library

In the same PowerShell window, navigate to your SQLite binaries directory:

cd C:\sqlite

Generate the import library from the DEF file:

lib /def:sqlite3.def /out:sqlite3.lib /machine:x64

Step 4: Build the FTS5 ICU Tokenizer

Clone or download this repository to a directory of your choice
Create a build directory:

mkdir build
cd build

Configure with CMake (replace paths with your actual paths):

cmake -G "Visual Studio 17 2022" -T host=x64 -A x64 .. `
-DICU_ROOT="C:\icu" `
-DSQLite3_INCLUDE_DIR="C:\sqlite\include" `
-DSQLite3_LIBRARY="C:\sqlite\sqlite3.lib"

Build the project:

cmake --build . --config Release

To build for a specific locale (e.g., Thai), add the LOCALE parameter during CMake configuration:

cmake -G "Visual Studio 17 2022" -T host=x64 -A x64 .. `
-DICU_ROOT="C:\icu" `
-DSQLite3_INCLUDE_DIR="C:\sqlite\include" `
-DSQLite3_LIBRARY="C:\sqlite\sqlite3.lib" `
-DLOCALE=th

This will create fts5_icu_th.dll and register the tokenizer as icu_th.

Step 5: Using the Extension

After successful compilation, you’ll find fts5_icu.dll (or fts5_icu_xx.dll for locale-specific builds) in the build\Release directory. To use it with SQLite:

Method 1: Simple Load (Windows)

The easiest and most reliable method on Windows is to copy the built fts5_icu.dll along with icudt77.dll and icuuc77.dll from the pre-compiled ICU4C (in this case ICU77) to your current directory, then use the command:

.load fts5_icu.dll

CREATE VIRTUAL TABLE documents USING fts5(
content,
tokenize = 'icu'
);

This method is recommended over full path loading because of how Windows resolves DLL dependencies. When you use a full path to load the extension (e.g., .load ./build/Release/fts5_icu.dll), Windows may not automatically search for the required ICU DLLs (icudt77.dll and icuuc77.dll) in the same directory. Instead, it follows the Windows DLL search order, which typically looks in:

The directory where the application (SQLite) is located
The system directory
The Windows directory
The current directory
The directories listed in the PATH environment variable By copying the DLLs to the current directory and using the simple load command, you ensure that all required DLLs are found correctly.

Usage

After compiling, you can load the specific tokenizer you built into SQLite.

Example for the Thai Tokenizer:

-- Provide the path to the specific library in your build directory.
.load ./build/libfts5_icu_th.so

-- Create a virtual table using the correctly named tokenizer
CREATE VIRTUAL TABLE documents_th USING fts5(
content,
tokenize = 'icu_th'
);

-- Insert and query Thai text
INSERT INTO documents_th(content) VALUES ('การทดสอบภาษาไทยในระบบค้นหา');
SELECT * FROM documents_th WHERE documents_th MATCH 'ภาษา';

Example for the Universal Tokenizer:

-- Provide the path to the specific library in your build directory.
.load ./build/libfts5_icu.so

-- Create a virtual table using the correctly named tokenizer
CREATE VIRTUAL TABLE documents USING fts5(
content,
tokenize = 'icu'
);

-- Insert and query text
INSERT INTO documents(content) VALUES ('甜蜜蜜,你笑得甜蜜蜜-หวานปานน้ำผึ้ง,ยิ้มของคุณช่างหวานปานน้ำผึ้ง');
SELECT * FROM documents WHERE documents MATCH 'หวาน';
SELECT * FROM documents WHERE documents MATCH '甜蜜蜜';

Locale Name Mappings

For compatibility with common usage, this project supports alternative locale codes:

Mapped with warnings (the alias is converted to the standard code):

Chinese: cn → zh (with warning)
Japanese: jp → ja (with warning)

Direct support (both codes work without mapping or warnings):

Korean: kr ↔ ko
Hebrew: iw ↔ he
Greek: gr ↔ el When using the mapped alias codes (cn or jp), you will see a warning message during the build process informing you of the mapping, but the resulting library will use the standard ICU locale code in its name and functionality.

Examples:

# This will show a warning and build libfts5_icu_zh.so
cmake .. -DLOCALE=cn

# This will build libfts5_icu_zh.so directly without warning
cmake .. -DLOCALE=zh

# These will both build libfts5_icu_ko.so without any warnings
cmake .. -DLOCALE=kr
cmake .. -DLOCALE=ko

Testing

This project includes comprehensive tests for all supported locales. You can run individual tests or all tests at once:

Running All Tests

# Build all tokenizers
./scripts/build_all.sh

# Test all tokenizers
./scripts/test_all.sh

Running Individual Locale Tests

# Test a specific locale (example with Japanese)
sqlite3 < tests/test_ja_tokenizer.sql

# Test the universal tokenizer
sqlite3 < tests/test_universal_tokenizer.sql

Supported Locales and Test Files

Each supported locale has a corresponding test file in the tests/ directory:

ar (Arabic): tests/test_ar_tokenizer.sql
el (Greek): tests/test_el_tokenizer.sql
he (Hebrew): tests/test_he_tokenizer.sql (also supports iw alias)
ja (Japanese): tests/test_ja_tokenizer.sql
ko (Korean): tests/test_ko_tokenizer.sql (also supports kr alias)
ru (Russian): tests/test_ru_tokenizer.sql
th (Thai): tests/test_th_tokenizer.sql
zh (Chinese): tests/test_zh_tokenizer.sql (also supports cn alias)
Universal: tests/test_universal_tokenizer.sql Note: For locales with aliases (cn→zh, jp→ja, kr→ko, iw→he, gr→el), the same test file works for both the standard code and its aliases, as they all map to the same tokenizer functionality.

Testing the ICU Transliterator

This project includes a simple C program to test the enhanced ICU transliterator rules with various language samples.

Building and Running the Test

You can build and run the test program using the provided scripts:

# Build and run the test
./scripts/run_test.sh

# Or just build the test
./scripts/build_test.sh

The test program demonstrates that the transliterator correctly converts various scripts to Latin/ASCII form, including:

Arabic
Cyrillic
Hebrew
Greek
Chinese (both Traditional and Simplified)
Japanese
Text with diacritics The transliterator uses the following rule chain: NFKD; Arabic-Latin; Cyrillic-Latin; Hebrew-Latin; Greek-Latin; Latin-ASCII; Lower; NFKC; Traditional-Simplified; Katakana-Hiragana

This ensures comprehensive script conversion and normalization for effective text search and indexing.

Locale-Specific Transliterator Rules

For optimized performance with specific languages, see the documentation in docs/BUILD_TEST_README.md for information about locale-specific rules.

The locale-specific tests can be found in the tests/ directory.

Why Locale-Specific Tokenizers Are More Efficient

When you build a tokenizer for a specific locale (e.g., -DLOCALE=ja for Japanese), the resulting library uses transliterator rules that are optimized for that language:

Reduced Processing Overhead: Instead of applying transformations for all possible scripts (Latin, Cyrillic, Arabic, Chinese, etc.), the locale-specific tokenizer only applies the transformations relevant to that language. 1. Language-Appropriate Normalization: Each locale uses normalization rules that are appropriate for that language’s characteristics:

Japanese (ja): NFKD; Katakana-Hiragana; Lower; NFKC
Chinese (zh): NFKD; Traditional-Simplified; Lower; NFKC
Thai (th): NFKD; Lower; NFKC
Korean (ko): NFKD; Lower; NFKC
Arabic (ar): NFKD; Arabic-Latin; Lower; NFKC
Russian (ru): NFKD; Cyrillic-Latin; Lower; NFKC
Hebrew (he): NFKD; Hebrew-Latin; Lower; NFKC
Greek (el): NFKD; Greek-Latin; Lower; NFKC

Faster Text Processing: By eliminating unnecessary script conversions, locale-specific tokenizers can process text significantly faster than the universal tokenizer. 1. More Accurate Results: Language-specific rules provide more accurate normalization and transliteration for the target language. In contrast, the universal tokenizer uses a comprehensive rule set that includes transformations for all supported scripts: NFKD; Arabic-Latin; Cyrillic-Latin; Hebrew-Latin; Greek-Latin; Latin-ASCII; Lower; NFKC; Traditional-Simplified; Katakana-Hiragana

While the universal tokenizer can handle text in any supported language, it has higher processing overhead because it must check and potentially apply all transformations for every piece of text.

When to Use Each Approach

Use locale-specific tokenizers when you know the primary language of your text data and performance is important.
Use the universal tokenizer when dealing with mixed-language content or when the language of the text is unknown at build time.

FTS5 ICU Tokenizer for SQLite

Prerequisites

FTS5 ICU Tokenizer for SQLite

Prerequisites

Building and Installing

Quick Build and Test

Building Individual Locales

Building on Windows

Prerequisites for Windows

Step 1: Download and Extract Dependencies

SQLite

ICU4C

Step 2: Generate SQLite Header Files

Step 3: Generate SQLite Import Library

Step 4: Build the FTS5 ICU Tokenizer

Step 5: Using the Extension

Method 1: Simple Load (Windows)

Usage

Locale Name Mappings

Testing

Running All Tests

Running Individual Locale Tests

Supported Locales and Test Files

Testing the ICU Transliterator

Building and Running the Test

Locale-Specific Transliterator Rules

Why Locale-Specific Tokenizers Are More Efficient

When to Use Each Approach

Similar Posts