RT by @AravSrinivas: We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x. (opens in new tab)
We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x. Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency.
Read the original article