Introduction
We trained and open-sourced a bilingual Semantic Highlight model that achieves state-of-the-art performance on both English and Chinese. The model automatically identifies and highlights semantically relevant sentences in retrieved documents based on semantic understanding.
Model Release:
- HuggingFace: zilliz/semantic-highlight-bilingual-v1
- License: MIT (commercial-friendly)
- Architecture: 0.6B Encoder-Only model based on BGE-M3 Reranker v2
- Context Window: 8192 tokens
- Supported Languages: English and Chinese
In this article, we’ll share our technical approach.
The Problem: RAG Token Cost and Quality
In production RAG systems, a typical query retrieves 10 documents with sever…
Introduction
We trained and open-sourced a bilingual Semantic Highlight model that achieves state-of-the-art performance on both English and Chinese. The model automatically identifies and highlights semantically relevant sentences in retrieved documents based on semantic understanding.
Model Release:
- HuggingFace: zilliz/semantic-highlight-bilingual-v1
- License: MIT (commercial-friendly)
- Architecture: 0.6B Encoder-Only model based on BGE-M3 Reranker v2
- Context Window: 8192 tokens
- Supported Languages: English and Chinese
In this article, we’ll share our technical approach.
The Problem: RAG Token Cost and Quality
In production RAG systems, a typical query retrieves 10 documents with several thousand tokens each, consuming tens of thousands of tokens per query. The problem: only a few dozen sentences actually contain relevant information, while the rest is noise that increases costs and degrades answer quality.
This creates an urgent need for a targeted highlight model that retains only the contextually relevant sentences by highlighting them, while pruning away all irrelevant noise content — a technique also widely referred to as context pruning.
Traditional keyword-based highlighting can’t solve this problem. When a user asks, "How to improve Python code execution efficiency?", traditional systems can only highlight words like "Python" and "efficiency." But the truly useful content—"Use numpy vectorized operations instead of loops"—contains none of the query terms and gets ignored.
This problem becomes even more severe in AI Agent scenarios, where queries are complex instructions after reasoning and decomposition. Traditional highlighting mechanically marks matching words but misses truly valuable analytical conclusions.
Semantic Highlighting solves this problem. It identifies sentences that semantically answer the query, even without keyword matches. This approach offers:
- 70-80% token cost reduction by sending only highlighted sentences to the LLM
- Improved answer quality as the LLM focuses on relevant content
- System interpretability showing why documents were retrieved and which sentences matter
- Easier debugging for engineers to trace retrieval issues
What we need is a lightweight, fast, and cost-effective small model (hundreds of MB, millisecond-level inference) deployable on search servers for real-time computation.
The Dilemma of Existing Models
We investigated existing solutions but found they didn’t quite meet our requirements.
OpenSearch’s Model: Limited Context Window
OpenSearch released opensearch-semantic-highlighter-v1, a model specifically for semantic highlighting.
However, it’s based on the BERT architecture with a 512-token limit—roughly 400-500 English words, which is not enough for real-world scenarios.
Provence/XProvence: Multilingual Trade-offs
Naver’s Provence model series was trained for Context Pruning—a task technically similar to Semantic Highlighting.
Provence is a monolingual English model with strong performance. XProvence extends this to over a dozen languages, but multilingual models typically show performance degradation compared to their monolingual counterparts.
There’s also a licensing consideration: both use the CC BY-NC 4.0 license, which restricts commercial use.
Open Provence: Open-source but Only English and Japanese
Open Provence is an outstanding open-source project that fully reproduces Provence’s training pipeline.
It includes training scripts, data processing tools, evaluation frameworks, and pre-trained models at different scales—all under an MIT license.
However, it currently supports only English and Japanese.
Our Choice: Train a Bilingual Model
No existing model can meet all our needs:
- Supports both English and Chinese
- Large enough context window
- Good out-of-domain generalization
- Good performance in Semantic Highlight scenarios
- Friendly license (MIT or Apache 2.0)
Since no suitable model exists on the market, we decided to train one ourselves.
Our Technical Approach
Training a model in this scenario isn’t inherently difficult; what’s challenging is training a good model that overcomes all the above problems and achieves near-SOTA performance. Our approach:
On the model side, we use the classic Encoder-Only small model architecture for fast inference performance.
On the data side, higher-quality training datasets lead to better training results. We use reasoning LLMs to generate high-quality data and leverage local model inference frameworks to accelerate and scale data generation.
Model Architecture: The Provence Approach
We adopted the Provence approach, which uses a lightweight Encoder-Only model that frames context pruning as a token-level scoring task.
Why Encoder-Only?
Although BERT-like Encoder-Only architectures are no longer the latest technology, they offer significantly faster speed and efficiency than modern LLMs. The key characteristic is the ability to train and simultaneously infer a score for each token position, outputting all token scores in parallel.
Inference Process:
The inference process is straightforward:
- Concatenate inputs as
[BOS] + Query + Context - Score each token in the context (between 0 and 1)
- Average token scores within each sentence to obtain sentence scores
- Highlight sentences with high scores while removing those with low scores
Base Model: BGE-M3 Reranker v2
We selected BGE-M3 Reranker v2 as our base model for several reasons:
- It employs an Encoder architecture suitable for token and sentence scoring
- Supports multiple languages with optimization for both English and Chinese
- Provides an 8192-token context window appropriate for longer RAG documents
- Maintains 0.6B parameters—strong enough without being computationally heavy
- Ensures sufficient world knowledge in the base model
- Trained for reranking, which closely aligns with relevance judgment tasks
Training Data: LLM Annotation with Reasoning Process
The key to our success was data construction. We had the LLM (Qwen3 8B) output its complete reasoning process during annotation. The annotation workflow is as follows:
Each training sample includes not just Query, Context, and Sentence Spans fields, but also an important Think Process field to record the reasoning process of the LLM.
Why include the reasoning process?
This approach provides several benefits:
- Higher annotation quality: Writing the reasoning process serves as self-verification, reducing errors
- Observable and debuggable: We can see why specific sentences were selected
- Enables debugging: Reveals whether incorrect annotations stem from prompt issues or knowledge gaps
- Data reusability: Provides reference explanation patterns for future re-annotation with different models
Why Qwen3 8B?
We used Qwen3 8B for annotation because it naturally supports a thinking mode with <think> outputs. The 8B size strikes the right balance—smaller models lack stability, while larger ones are too slow and expensive.
We ran annotation using a local vLLM service rather than cloud APIs, providing high concurrent throughput and cost efficiency by trading GPU time for token costs.
Dataset Scale:
Ultimately, we constructed nearly 5 million bilingual training samples, split evenly between English and Chinese.
- English data came from MS MARCO, Natural Questions, and GooAQ
- Chinese data came from DuReader, Chinese Wikipedia, and mmarco_chinese
Some data came from Open Provence and similar sources with re-annotation, while other portions were generated from raw corpora through query and context generation, followed by annotation.
All annotated training data is also available on HuggingFace for community development and training reference: https://huggingface.co/zilliz/datasets
Training Process
With the model architecture and dataset prepared, we trained on 8 A100 GPUs for 3 epochs over approximately 9 hours.
The training focused on the Pruning Head for the Semantic Highlight task without training the Rerank Head, which helped us achieve better performance on this specific task.
Evaluation Results: Achieving SOTA Performance
We compared different models’ performance across multiple datasets, including:
- English multi-span QA dataset (multispanqa)
- Wikipedia out-of-domain dataset (wikitext2)
- Chinese multi-span QA dataset (multispanqa_zh)
- Chinese version of Wikipedia out-of-domain dataset (wikitext2_zh)
Evaluated models include the Open Provence series, Naver’s Provence/XProvence series, OpenSearch’s semantic-highlighter, and our trained bilingual model.
Key findings:
- Our model ranks first across all four evaluation datasets
- It’s the only model that demonstrates strong performance on both English and Chinese
- Other models either support only English or show significant performance degradation on Chinese text
Real-World Case Study: Precisely Identifying Core Sentences
Beyond benchmark scores, let’s examine a more interesting example to intuitively demonstrate our model’s performance in practical applications.
Question: "Who wrote The Killing of a Sacred Deer?"
Text (5 sentences total):
1. The Killing of a Sacred Deer is a 2017 psychological horror film directed by Yorgos Lanthimos,
with a screenplay by Lanthimos and Efthymis Filippou.
2. The film stars Colin Farrell, Nicole Kidman, Barry Keoghan, Raffey Cassidy,
Sunny Suljic, Alicia Silverstone, and Bill Camp.
3. The story is based on the ancient Greek playwright Euripides' play Iphigenia in Aulis.
4. The film tells the story of a cardiac surgeon (Farrell) who secretly
befriends a teenager (Keoghan) connected to his past.
5. He introduces the boy to his family, who then mysteriously fall ill.
Correct Answer: Sentence 1 (explicitly states "screenplay by Lanthimos and Efthymis Filippou")
This example has a trap: Sentence 3 mentions that "Euripides" wrote the original play. But the question asks "who wrote the film The Killing of a Sacred Deer," and the answer should be the film’s screenwriters, not the Greek playwright from thousands of years ago.
Model Performance:
| Model | Found Correct Answer | Prediction |
|---|---|---|
| Our Model | ✓ | Selected sentences 1 (correct) and 3 |
| XProvence v1 | ✗ | Only selected sentence 3, missed correct answer |
| XProvence v2 | ✗ | Only selected sentence 3, missed correct answer |
Key Sentence Score Comparison:
| Sentence | Our Model | XProvence v1 | XProvence v2 |
|---|---|---|---|
| Sentence 1 (film screenplay, correct answer) | 0.915 | 0.133 | 0.081 |
| Sentence 3 (original play, distractor) | 0.719 | 0.947 | 0.802 |
The results are revealing:
XProvence’s Problem:
- Strongly attracted to "Euripides" and "play," giving sentence 3 near-perfect scores (0.947 and 0.802)
- Completely ignores the actual answer (sentence 1), giving extremely low scores (0.133 and 0.081)
- Even when lowering the threshold from 0.5 to 0.2, it still can’t find the correct answer
Our Model’s Performance:
- Gives the correct answer (sentence 1) a high score of 0.915, clearly identifying the film screenwriters
- Also gives sentence 3 some score (0.719) since it mentions information related to the play
- The distinction is clear: 0.915 vs 0.719, with a gap of nearly 0.2
This example demonstrates our model’s key strength: understanding the true intent of questions.
In the context of a film encyclopedia, "Who wrote The Killing of a Sacred Deer" clearly asks about the film’s screenwriters. Although the text contains both screenplay and original play information, our model accurately identifies what the user is looking for.
Standing on the Shoulders of Giants
This model’s development builds on significant prior work, and we want to acknowledge the contributions that made our work possible:
- Provence’s theoretical foundation: Proposed an elegant approach of using lightweight Encoder models for context pruning
- Open Provence codebase: Provided well-implemented training pipelines, data processing, and model heads with open-source licensing
Building on these foundations, we contributed several innovations:
- LLM annotation with reasoning processes to improve data quality
- Nearly 5 million bilingual training samples covering English and Chinese scenarios aligned with practical needs
- Selection of a base model more suitable for RAG scenarios (BGE-M3 Reranker v2)
- Focused training on the Pruning Head for the Semantic Highlight task
We sincerely thank the Provence team and Open Provence project contributors for their foundational work.
Open Source Release and Getting Started
We’re now open-sourcing our model under the MIT license, making it safe for commercial use.
Model Download:
- HuggingFace: zilliz/semantic-highlight-bilingual-v1
Training Data:
- HuggingFace: All annotated training data is also available on HuggingFace for community development and training reference: https://huggingface.co/zilliz/datasets
Additionally, we’re working on turning the model inference into a service and integrating and deploying it in Milvus as a Semantic Highlight interface. This will be available soon.
Conclusion
In this article, we shared our journey from identifying the token cost problem in production RAG systems to building a state-of-the-art bilingual Semantic Highlighting model:
- We analyzed the limitations of traditional keyword-based highlighting across different scenarios
- We evaluated existing solutions and identified their shortcomings
- We developed a novel training methodology using LLM annotation with reasoning processes
- We achieved SOTA performance on both English and Chinese datasets
- We open-sourced our model and training data under the MIT license for the community
This model addresses multiple real-world production requirements: strong bilingual performance, a sufficient context window, good generalization, and a commercially friendly open-source license.
We hope this model can help developers build better RAG/Agent systems at lower cost and higher quality, enhancing their debuggability and interpretability; it can also be extended to any text retrieval system, e.g., recommendation systems, and serve as a semantic highlighting feature. Feel free to try it out and provide feedback anytime.
Related Links
- Model Download: zilliz/semantic-highlight-bilingual-v1
- Open Provence Project: hotchpotch/open_provence
- Provence Paper: arXiv:2501.16214
- Provence Official Introduction: Provence: efficient and robust context pruning for retrieval-augmented generation
- XProvence Model: naver/xprovence-reranker-bgem3-v1
- Milvus: milvus.io
- Zilliz Cloud: zilliz.com