Introduction
The Pre-Qin period represents one of the most intellectually vibrant eras in Chinese history, when the Hundred Schools of Thought flourished and developed grand ideological systems encompassing philosophy, politics, ethics, and conceptions of nature. The terminologies of major schools such as Confucianism, Daoism, Mohism, and Legalism not only embody profound philosophical significance but also reveal divergent interpretations of key concepts such as “天” (Heaven), “道” (Dao), and “禮” (Ritual)1,[2](https://www.nature.com/articles/s40494-025-02112-z#ref-CR2 “Radice, T. Li (ritual)…
Introduction
The Pre-Qin period represents one of the most intellectually vibrant eras in Chinese history, when the Hundred Schools of Thought flourished and developed grand ideological systems encompassing philosophy, politics, ethics, and conceptions of nature. The terminologies of major schools such as Confucianism, Daoism, Mohism, and Legalism not only embody profound philosophical significance but also reveal divergent interpretations of key concepts such as “天” (Heaven), “道” (Dao), and “禮” (Ritual)1,2. These terms constitute both the foundation of Chinese intellectual history and critical entry points for understanding the spiritual essence of traditional Chinese thought. However, compiling dictionaries of Pre-Qin philosophical terms remains challenging due to polysemy, overlapping doctrines, and the large volume of classical texts written in archaic and context-dependent language3.
In response, computational linguistic techniques have been increasingly adopted for historical and classical text analysis. Tools such as ParaConc and statistical models including TF-IDF and topic modeling have been applied to terminology identification, semantic clustering, and thematic classification4,5. While these approaches improve efficiency, they often fall short when applied to Pre-Qin texts: the highly abstract language, complex syntactic structures, and doctrinal polysemy make it difficult to capture semantic nuance. Statistical or alignment-based methods tend to oversimplify relationships and cannot adequately handle contextual variability. These limitations underscore the need for more semantically robust approaches.
In recent years, natural language processing (NLP) has undergone a paradigm shift with the emergence of large language models (LLMs), which offer enhanced capabilities in semantic understanding and contextual modeling6,7. In the domain of classical Chinese text processing, LLMs have been applied to a wide range of tasks, leading to methodological breakthroughs8,9. Pretrained models have achieved substantial success in word segmentation, named entity recognition, and automatic annotation, thereby improving the infrastructure for computational philology10,11,12. In text generation, LLMs have advanced the quality of translations between classical Chinese, modern Chinese, and English13. Specialized adaptations—such as Redundancy-Aware Tuning (RAT) and retrieval-augmented modeling—have been integrated into frameworks like TongGu-LLM, enhancing model sensitivity to syntactic redundancy and abstract conceptual structures14. Domain adaptation strategies have also improved their applicability to specialized knowledge extraction15. Beyond general text processing, LLMs have shown promise in computational lexicography, supporting definition generation, bilingual example construction, and entry template filling16. In high-resource languages, LLMs can produce dictionary entries comparable to those written by experts. Yet unresolved challenges persist for low-resource, high-abstraction domains such as Pre-Qin philosophy, including semantic precision, contextual disambiguation, stylistic fidelity17,18 and consistent handling of receptive tasks such as polysemy resolution19,20. Moreover, most research has focused on general-purpose or learner-oriented lexicography, with limited exploration of abstract philosophical terminologies21.
To better foreground our contribution, Fig. 1 contrasts the traditional expert-driven lexicographic workflow with our LLM-assisted design across key stages. Traditional compilation relies on sequential, manual operations—from corpus selection and term identification to school attribution, definition writing/translation, and entry structuring—which ensures scholarly rigor but limits scalability. In contrast, our approach organizes the work into three modules (corpus and dataset construction; model training and evaluation; dictionary construction and visualization) and, within the technical implementation module, executes four parallel semantic tasks—term identification, school classification, definition generation, and context translation. Human oversight is retained at validation points, while model-assisted generation and structured aggregation support higher efficiency and consistency. This arrangement clarifies how the proposed framework differs from a linear pipeline and how it translates into concrete system outputs showcased in the dictionary interface.
Fig. 1
Comparison of traditional lexicographic workflow and the proposed LLM-based automated framework.
Motivated by the limitations of traditional methods and the opportunities enabled by recent advances in large language models, this study proposes a systematic framework for the automated compilation of a Pre-Qin philosophical terminology dictionary. The central objective is to construct an intelligent digital dictionary system that integrates term identification, school classification, definition generation, and context translation, while addressing the key challenges of semantic drift, ambiguity in school attribution, and the difficulty of receptive task generalization. Specifically, we compile a large-scale corpus of Pre-Qin texts and historical commentaries from platforms such as Chinese Text Project and Guoxue Dashi, and extract structured definitional data from authoritative sources such as the Encyclopedia of Chinese Philosophy. At the model level, we employ DeepSeek-R1-Distill-Qwen, Qwen3, and Llama3, trained within the LLaMA Factory framework with LoRA adaptation. Continued pretraining sensitizes the models to classical stylistic and philosophical nuances, while few-shot fine-tuning enhances task performance under low-resource conditions. Task-specific prompts then guide inference across the four semantic tasks. Collectively, these strategies converge in a unified approach that emphasizes both semantic precision and contextual fidelity. Building upon this framework, we construct a structured terminology database and an interactive Streamlit-based interface, thereby realizing a scalable paradigm for automated lexicography in the study of early Chinese philosophy.
Methods
This study is dedicated to the goal of automating the compilation of a dictionary of philosophical terms from Pre-Qin thinkers, and the overall technical architecture is designed around the capabilities of large language models (LLMs), as illustrated in Fig. 2. The technical workflow consists of three interconnected functional modules: (1) Corpus and Dataset Construction, (2) Pretraining, Fine-tuning and Evaluation Strategies, and (3) Dictionary Construction and Visualization. These three modules are interdependent and collectively constitute a complete pathway from data construction to knowledge generation and system output.
Fig. 2
Research Framework.
Corpus and dataset construction
The first component focuses on corpus collection and dataset construction. The acquisition of high-quality textual resources is essential for ensuring the semantic depth and epistemic coverage of the dictionary. Primary source materials were collected from authoritative digital platforms such as Chinese Text Project, Guoxue Dashi, and Shidian Ancient Books, encompassing 36 representative classical texts spanning diverse Pre-Qin philosophical traditions, including but not limited to Confucianism, Daoism, and other major schools of thought. These texts amount to approximately 1.6 million traditional Chinese characters. To prepare the corpus for computational processing, comprehensive data cleaning was conducted—standardizing formats, punctuation, and sentence segmentation—followed by the extraction of key textual elements for semantic modeling. This resulted in a well-structured source corpus that underpins both domain-adaptive training and dictionary compilation.
In addition to primary philosophical texts, this study incorporated annotated commentaries spanning the Han to Qing dynasties, sourced primarily from Shidian Ancient Books. A total of 37 exegetical works were integrated, including canonical annotations such as Wang Bi’s Commentary on the Daodejing and Zhu Xi’s Collected Annotations on the Four Books. These commentaries offer diachronic semantic interpretations and doctrinal exegesis, providing valuable semantic scaffolding for both model training and inference. By integrating both primary texts and subsequent annotations, the constructed corpus captures the linguistic, conceptual, and interpretive richness of pre-Qin philosophy.
For the construction of the fine-tuning dataset, the study takes the Encyclopedia of Chinese Philosophy (edited by Zhang Dainian) as its foundational reference. Using OCR (Optical Character Recognition) technology, entries pertaining to Pre-Qin philosophical terms and their definitions were extracted, encompassing both shared and school-specific terminology. After manual verification and structural processing, a preliminary dataset was compiled containing 430 data triples in the form of “philosophical term – affiliated school – textual instance.” Considering the need for representativeness and typicality in few-shot fine-tuning scenarios, 350 entries were selected to form a refined training dataset aimed at enhancing the model’s semantic reasoning performance in low-resource conditions. To enrich the contextual dimension of the dataset, supplementary definitional resources were incorporated from Hanyu Cidian Wang and Gushiwen Wang. The former aggregates multiple high-quality lexicographical tools such as the Comprehensive Dictionary of Chinese Words and the Encyclopedia of Chinese History, offering semantic annotations for some philosophical terms in historical contexts. The latter is a comprehensive classical literature platform that includes not only the original texts but also their modern Chinese translations, thereby facilitating the construction of a parallel corpus linking ancient and modern Chinese expressions of philosophical discourse. Ultimately, the constructed dataset for model fine-tuning consists of five structured fields: (1) Philosophical Term, (2) Affiliated School, (3) Contextual Excerpt, (4) Term Definition, and (5) Context Translation (into modern Chinese). This corpus serves as a semantically rich and culturally grounded resource for training models on tasks such as term identification, school classification, definition generation, and machine translation.
The selection of philosophical terms followed a rigorous multi-dimensional principle to ensure the dataset’s theoretical soundness and training efficacy. From the 430 term entries extracted from the Encyclopedia of Chinese Philosophy, this study selects 350 representative terms to form a refined dataset for few-shot fine-tuning, with 250 entries allocated for training and 100 for testing. The selection process places particular emphasis on three dimensions: philosophical representativeness, semantic complexity, and coverage within the corpus, to ensure that the dataset provides sufficient depth and discriminative power for the training tasks. First, in terms of philosophical representativeness, the study prioritizes core concepts that occupy foundational positions within various schools of thought. These terms not only appear frequently in the primary sources but also play fundamental roles in the theoretical systems of the respective traditions. For example, “仁” (benevolence) and “禮” (ritual) in Confucianism, “道” (Dao) and “無為” (non-action) in Daoism, and “法” (law) and “術” (technique) in Legalism are all considered cornerstone concepts within their schools’ ideological architectures22. Second, semantic ambiguity is a critical criterion for term selection. Pre-Qin philosophical terms are often shared across schools and reused in different contexts, exhibiting significant polysemy. For instance, the term “道” (Dao) in Daoist discourse typically refers to the cosmological origin or natural law; in Confucian contexts, it leans toward political ideals or moral principles; whereas in Legalist writings, it transforms into a designation for administrative methods or institutional norms23. Terms like these, which embody multiple semantic evolution paths, serve as ideal material for training disambiguation models, thereby enhancing the model’s ability to make accurate distinctions and generate contextually appropriate interpretations in complex linguistic environments. Additionally, coverage in the primary corpus and exegetical literature is another crucial dimension in the selection process. This study conducts a statistical analysis of the frequency of all 430 terms across the 36 Pre-Qin classics and 37 post-Qin exegetical texts. Terms that appear frequently across multiple sources and are accompanied by extensive annotations are prioritized, as they offer a broader array of training contexts and facilitate multi-layered semantic interpretation. To this end, the research team employs a combination of methods including frequency retrieval, semantic comparison, and expert manual evaluation to finalize the selection of the 350 terms used in the training set. These terms broadly span the major philosophical schools—Confucianism, Daoism, Legalism, Logicians, and Mohism—and exhibit cross-school semantic intersections and diverse evolutionary trajectories.
Notably, the high semantic ambiguity and doctrinal polysemy of many selected terms also make them particularly well-suited to the few-shot learning paradigm. In recent studies on large language models (LLMs), small but semantically rich training sets have been shown to be highly effective in guiding model behavior, especially when combined with prior domain adaptation24,25,26,27. While the manually annotated data in this study is relatively limited in scale, its impact is amplified through careful curation and the inherent generalization capabilities of LLMs. Pretrained on massive corpora, LLMs possess broad linguistic competence and latent domain knowledge. Continued pretraining on domain-specific corpora further sensitizes the models to the stylistic and conceptual patterns of Pre-Qin philosophy. Within this framework, the role of the 350 annotated samples is to serve as task-specific guidance rather than to instill primary knowledge from scratch.
In summary, the selection of philosophical terms is not merely a technical exercise based on frequency or textual coverage; rather, it is grounded in an in-depth understanding of the epistemic structure of philosophical concepts and the principles of semantic evolution in classical Chinese. The result is a refined training dataset that embodies both theoretical richness and semantic complexity, providing a solid foundation for the subsequent stages of model training. By coupling this carefully curated dataset with the representational power of large language models and adopting a training strategy that emphasizes task alignment and semantic generalization, this approach effectively mitigates the limitations posed by small sample sizes. Moreover, this term selection methodology offers a reproducible and theoretically grounded paradigm for term modeling and semantic inference in the digital humanities, particularly in the context of ancient Chinese textual studies.
Pretraining, fine-tuning and evaluation strategies
To tailor large language models to the specific demands of identifying and interpreting Pre-Qin philosophical terms, this study adopts a systematic training paradigm that integrates continued pretraining, instruction-based fine-tuning, and prompt-driven evaluation. The training process is built upon the LLaMA Factory open-source framework, incorporating Low-Rank Adaptation (LoRA) techniques to enable parameter-efficient fine-tuning28. This approach allows the core model parameters to remain fixed while updating only a set of low-rank matrices, thereby reducing computational overhead without compromising the model’s adaptability to domain-specific semantics. In designing the training strategy, we incorporate both continued pretraining on domain corpora and the integration of commentary-enhanced resources, with comparative evaluation reported in Section “Results”.
In the continued pretraining stage, the objective is to enhance the model’s semantic representation capabilities regarding the stylistic features of Pre-Qin texts, the conceptual framework of terminology, and the structure of philosophical argumentation, through sustained training on domain-specific corpora. The training corpus primarily includes two categories of texts: (1) the original writings of the Pre-Qin philosophical schools, such as The Analects, Mozi, and Zhuangzi; and (2) annotated commentaries from later dynasties, such as Annotations and Commentaries on The Analects and The Correct Meaning of the Book of Documents. All texts are sourced from authoritative platforms for the digitization of ancient Chinese books, and are uniformly cleaned and formatted into paragraph-based txt files. To enhance the model’s ability to distinguish between text sources and styles, each paragraph is preceded by lightweight source indicators such as “[Original]” or “[Commentary]” which function as soft prompts. These indicators serve to guide the model in perceiving textual style and register differences during training.
During the instruction-based fine-tuning stage, the model is adapted to domain-specific tasks using 350 manually constructed annotated samples, which cover four major tasks: term identification, philosophical school classification, definition generation, and context translation. The first two tasks fall under natural language understanding (NLU), while the latter two belong to natural language generation (NLG), thereby forming a compact and semantically coherent composite task system. All samples used in the fine-tuning phase are constructed in an “instruction–response” format, and output structures are constrained to ensure that the model can learn mappings among task structure, linguistic style, and pragmatic intent, even under limited supervision. In order to maintain consistency, the fine-tuning stage followed a single protocol across configurations. Comparative outcomes are presented in Section “Results”. The core training parameters for both pretraining and fine-tuning stages are summarized in Table 1 below:
After completing model fine-tuning, the study proceeds to the prompt-driven evaluation stage, where the model’s generalization ability is tested under zero-shot and few-shot settings29(see Section “Results”, Table 2 for the prompt template). In the zero-shot scenario, the model receives only the standard task prompt without any example for guidance; in the few-shot setting, a typical example of the task is embedded within the input prompt, allowing the model to infer task structure and pragmatic patterns from the contextual cue. During this stage, model parameters are no longer updated—performance depends solely on prompt design—aiming to evaluate the model’s real-time comprehension and generation capabilities when applied to the task of interpreting Pre-Qin philosophical terms.
Dictionary construction and visualization
The final component of the framework focuses on structured dictionary compilation and interactive visualization. After obtaining model-generated outputs from the fine-tuned models, a standardized term entry template was designed to organize the extracted data into five key fields: philosophical term, affiliated school, contextual excerpt, term definition, and context translation. The structured entries were aggregated into a term-level database to ensure persistent storage, semantic consistency, and query efficiency. A prototype system was developed using the Streamlit platform, supporting functionalities such as term lookup, school-based filtering, and bilingual display of entries. The visual interface enables dynamic querying and browsing of dictionary content, integrating automated extraction, structured presentation, and human-centered interaction within a unified technical architecture.
Model selection and evaluation metric design
In this study, we carefully considered the choice of base models for domain-adaptive pretraining and fine-tuning, aiming to ensure both the linguistic relevance and the generalization potential of our methodology. The selected large language models (LLMs) needed to meet three core criteria: (1) strong Chinese language modeling capabilities, (2) robust multi-task generalization and semantic reasoning abilities, and (3) compatibility with parameter-efficient fine-tuning methods.
Based on these criteria, we selected three architecturally advanced and widely adopted open-source models as the foundational backbones for our experiments: DeepSeek-R1-Distill-Qwen, Qwen3, and Llama330,31,32. These models collectively represent the current state of open-source language modeling in Chinese and multilingual contexts, offering a comprehensive basis for evaluating the adaptability of our proposed method across different architectures and training paradigms.
DeepSeek-R1-Distill-Qwen is a distilled variant of the DeepSeek-R1 series, optimized for resource-efficient deployment without compromising performance. It inherits the language comprehension capabilities of the Qwen architecture while significantly compressing model parameters through knowledge distillation. With strong Chinese semantic modeling performance, it serves as an ideal candidate for rapid adaptation in ancient text processing tasks, especially under constrained computational environments.
Qwen3, the latest generation of the Qwen series, introduces architectural innovations such as the “thinking budget” mechanism for enhanced reasoning over semantically complex inputs. Pre-trained on a massive corpus with a substantial portion of high-quality Chinese texts, Qwen3 demonstrates superior performance in both long-context understanding and domain-specific knowledge integration. Its 128 K token context window further supports processing of extended philosophical passages rich in citations and annotations, making it particularly suitable for our study’s goals.
Llama3, developed by Meta AI, is a multilingual foundation model known for its cross-lingual reasoning capability and scalable fine-tuning performance. While not exclusively optimized for Chinese, Llama3 exhibits competitive results in Chinese NLP tasks due to its robust architecture and extensive pretraining. Its inclusion in this study serves a dual purpose: validating the generalizability of our workflow across non-Chinese-specialized models and testing its adaptability to the domain-specific tasks of classical Chinese philosophy definition.
To comprehensively evaluate the performance of the selected models on the tasks involving Pre-Qin philosophical term, this study designs a multi-dimensional evaluation metric system aligned with the characteristics of the four subtasks.
For the philosophical term identification task, which essentially belongs to the domain of sequence labeling, the objective is to detect whether predefined terms are present in the input text and to identify their exact boundaries. Accordingly, standard evaluation metrics including Precision, Recall, and F1-score are adopted. Precision measures the proportion of correctly predicted terms, recall assesses the proportion of true terms that are successfully identified, and the F1-score provides a harmonic mean of the two, representing the overall effectiveness in term identification and localization. This metric system is particularly suitable for high-density text fragments with clear term boundaries and explicit semantic annotations.
For the school classification task, where the model must infer the philosophical school associated with a given term based on its context, this is treated as a multi-class classification problem. The study uses accuracy as the basic performance metric and introduces Macro-F1 to mitigate the effects of class imbalance. This is especially important in cases where the number of terms from minor schools is significantly lower than those from major schools, making Macro-F1 a fairer reflection of performance across all categories.
In the contextual term definition generation task, the model is required to produce semantically appropriate explanations for philosophical terms based on their surrounding context. As this constitutes an open-ended text generation task, the study employs automated metrics such as BLEU (to evaluate n-gram overlaps) and ROUGE-L (to assess the longest common subsequence) to measure lexical and structural similarities between the generated and reference definitions. Considering the highly abstract and polysemous nature of philosophical terms, the study also incorporates human expert evaluations as a complementary metric. Three researchers with backgrounds in classical philosophy and linguistics independently score the model outputs on a five-point scale based on the following criteria: semantic appropriateness (whether the definition accurately reflects the meaning in context), linguistic fluency (the naturalness of the expression), and school consistency (whether the style and content align with the term’s philosophical tradition). The final score is computed as a weighted average of these three dimensions.
For the context translation task, the model translates ancient Chinese sentences containing philosophical terms into modern Chinese. As a form of constrained language generation, the task places equal emphasis on literal translation accuracy and the semantic and stylistic fidelity of the rendered text. In addition to BLEU and ROUGE-L, the ChrF metric (Character-F Score) is introduced to evaluate fine-grained matches at the morpheme level, thereby capturing nuanced lexical correspondences between the original and translated versions.
Results
Prompt design example
To contextualize the prompting conditions used in the evaluations, we provide the few-shot prompt template employed for the contextual term definition task and summarize the corresponding prompt sensitivity setup. The prompt design clarifies input–output formatting and constrains generation behavior for better task alignment. The exact template is shown in Table 2.
In the results that follow, zero-shot and few-shot settings correspond to the absence or presence of an embedded example of this format, respectively; performance differences between these settings reflect prompt sensitivity under otherwise identical model configurations.
Ablation study on pretraining strategies
To systematically examine the contribution of different pretraining strategies within our framework, we conducted controlled ablation experiments that focused on two critical dimensions. First, we investigated the necessity of domain-adaptive continued pretraining prior to fine-tuning. Second, we examined the impact of incorporating annotated commentarial corpora into the pretraining data mixture.
All ablation experiments were performed using the Qwen3–14B model, as it demonstrated the best overall performance in the initial benchmark evaluation. This selection ensures experimental consistency and avoids potential variability introduced by architectural differences across models. We compared three configurations: direct fine-tuning only, continued pretraining on original texts followed by fine-tuning, and continued pretraining on both original texts and commentaries followed by fine-tuning. The experimental results are summarized in Table 3 below:
The comparison between fine-tuning only and continued pretraining on original texts reveals substantial performance improvements across all four tasks. Term Identification benefits significantly from continued pretraining, with precision, recall, and F1-score all showing notable gains of approximately 6–7 percentage points. School Classification demonstrates even more pronounced improvements, achieving substantial accuracy gains that highlight the importance of domain-specific knowledge for philosophical school recognition. In the Definition Generation task, continued pretraining leads to modest yet consistent improvements across both automatic metrics (BLEU-4 and ROUGE-L) and human evaluation scores, indicating a slight enhancement in the model’s ability to produce coherent and accurate definitions. Context Translation also shows consistent improvements across all evaluation metrics, though the gains are more modest compared to other tasks. These results demonstrate that domain-adaptive continued pretraining provides essential knowledge that cannot be acquired through downstream fine-tuning alone, particularly for tasks requiring deep understanding of philosophical concepts and terminology.
The inclusion of commentarial corpora in the pretraining mixture provides additional but task-dependent benefits. Definition Generation exhibits the most substantial gains from commentary integration, with meaningful improvements across both automatic evaluation metrics and human assessments, indicating that scholarly commentaries enhance the model’s ability to generate nuanced and contextually appropriate definitions. Context Translation also shows consistent improvements across all metrics, suggesting that commentarial texts provide valuable contextual knowledge for cross-lingual understanding. School Classification demonstrates modest but meaningful improvements in both accuracy and balanced performance across different philosophical schools. In contrast, Term Identification shows minimal additional gains from commentary inclusion, with only marginal improvements in precision while maintaining identical performance on other metrics. These patterns suggest that commentarial corpora primarily benefit tasks requiring deeper semantic understanding and contextual interpretation, while providing limited additional value for entity recognition tasks that rely more heavily on surface-level linguistic patterns.
The ablation study results establish a clear hierarchy of pretraining strategy effectiveness, with continued pretraining on original texts providing the most substantial improvements over fine-tuning only, and commentary integration offering additional task-specific benefits. The consistent performance gains across diverse evaluation metrics and tasks demonstrate the robustness of domain-adaptive pretraining for classical Chinese philosophical texts. Notably, improvements are more pronounced in Definition Generation and School Classification, which involve conceptually complex reasoning grounded in historical and intellectual traditions. The varying impact of commentarial integration across tasks further indicates that the utility of additional textual sources is modulated by task-specific cognitive demands. These findings underscore the potential of targeted pretraining pipelines to enhance linguistic and conceptual modeling for domain-specific language understanding.
Multi-model performance benchmark under zero-shot and few-shot conditions
While the ablation study demonstrates the relative efficacy of pretraining strategies—particularly the value of continued pretraining and commentary integration—a critical remaining question is how these models generalize to real-world usage scenarios. To this end, we conducted a comprehensive benchmark evaluating the three final model variants (trained on original texts + commentaries with full fine-tuning) across all four core tasks under both zero-shot and few-shot conditions. Specifically, models were assessed on term identification, philosophical school classification, definition generation, and context translation using eleven metrics, with results averaged over multiple runs for stability (Table 4). This analysis reveals their robustness and adaptability in practical applications.
The comprehensive benchmark evaluation reveals a clear performance hierarchy across the three model families tested. The Qwen3 series consistently demonstrates superior performance across all four tasks and eleven evaluation metrics, with Qwen3–14B achieving the highest scores in most categories under both prompting conditions. Remarkably, even the smaller Qwen3-8B model frequently outperforms larger models from other series, suggesting that architectural innovations and specialized training methodologies significantly impact performance beyond mere parameter scaling. The DeepSeek-R1-Distill-Qwen models occupy a middle tier, showing competitive but consistently lower performance compared to their Qwen3 counterparts across all tasks. The Llama series exhibits the most constrained performance profile, with Llama-3.1-8B-Instruct demonstrating moderate capabilities while Llama-3.2-1B-Instruct shows substantial limitations across all evaluation dimensions. This hierarchy becomes particularly pronounced in generation tasks, where the performance gaps between model families are most evident, highlighting the varying degrees of sophistication in handling classical Chinese philosophical content.
The performance disparities across different tasks reveal specific strengths and limitations of each model family. In term identification, Qwen3 models excel at distinguishing philosophical usage from literal usage, as demonstrated by their ability to correctly handle polysemous terms like “道” (Dao) in different contexts. The DeepSeek models show particular weakness in this disambiguation task, often misclassifying literal usage as philosophical, especially in zero-shot settings where contextual guidance is limited. Llama3 models struggle even more significantly, with performance gaps becoming pronounced when dealing with subtle semantic distinctions. School classification amplifies these differences, as this task demands sophisticated contextual understanding to attribute terms like “無為” (Non-action) to correct philosophical schools based on surrounding discourse patterns. Qwen3 models demonstrate superior contextual awareness, while DeepSeek and Llama models show increasing difficulty with fine-grained semantic discrimination. Definition generation presents the most dramatic performance gaps, where Qwen3-14B produces philosophically appropriate and stylistically authentic definitions, contrasting sharply with the more generic explanations from competing models. Context translation further emphasizes these disparities, with Qwen3 models achieving superior semantic fidelity and stylistic adaptation, while other models produce more literal and less contextually appropriate translations.
The transition from zero-shot to few-shot prompting yields consistently positive but model-dependent improvements across all evaluation scenarios, with qualitative differences that illuminate each model’s capacity for contextual learning. For the Qwen3 series, few-shot prompting provides substantial enhancements, with F1 scores in understanding tasks improving by approximately 0.5–0.6 percentage points and generation metrics showing gains of 0.2–0.4 points. These improvements are particularly notable in complex tasks like definition generation, where Qwen3-14B’s human evaluation score increases from 3.8 to 4.0, representing a meaningful qualitative enhancement. For example, when defining “性” (xing), Qwen3-14B with few-shot prompting produces the philosophically precise explanation “refers to the innate essential attributes of a person,” demonstrating sophisticated comprehension of Mencian philosophy, while its zero-shot counterpart offer less contextually grounded definitions.
The DeepSeek models demonstrate similar responsiveness to few-shot examples, though with slightly smaller magnitude improvements, suggesting good but not optimal utilization of contextual information. In term identification tasks, DeepSeek models show improved disambiguation of polysemous terms like “道” (Dao) when provided with examples, reducing misclassification rates in complex contexts. However, they still struggle more than Qwen3 models with subtle philosophical distinctions even in few-shot settings. The improvement patterns are most consistent in term identification and school classification tasks, where the provision of examples helps models better understand task structures and philosophical nuances.
The Llama models, particularly Llama-3.2-1B, show more modest gains from few-shot prompting, with improvements typically ranging from 0.2 to 0.4 points across metrics. This limitation becomes evident in context translation tasks, where even with examples, Llama models tend to produce more literal translations like “People with different ideas cannot plan things together” for “道不同, 不相為謀,” lacking the nuanced understanding demonstrated by Qwen3’s rendering of “People with different aspirations cannot make plans together.” These qualitative differences suggest inherent limitations in the Llama3 models’ ability to effectively leverage contextual examples for complex philosophical reasoning. The consistent improvement across all model families confirms that few-shot prompting serves as an effective domain adaptation strategy, though the absolute performance ceiling remains constrained by fundamental model capabilities and architectural sophistication.
Qualitative case study results
To complement the quantitative results, we present three representative qualitative cases arranged from easier to harder disambiguation: a Legalist sentence with one philosophical term (“法不阿貴,繩不撓曲。”, Han Feizi), a Confucian sentence with three terms (“學而時習之, 不亦說乎? ”, The Analects), and a Daoist sentence with three occurrences of the character “道” but mixed roles—two philosophical nouns and one verbal usage (“道可道, 非常道。”, Daodejing). For each case we compare four configurations—Base (no fine-tuning), FT (fine-tuning only), CPT + FT (continued pretraining then fine-tuning), and Full (CPT + FT with few-shot prompting)—over four tasks that mirror our dictionary schema: term identification, school classification, definition generation, and context translation. Unless otherwise stated, all casewise outputs are generated with the same backbone, Qwen3-14B, under the respective configurations. Casewise outputs are summarized in Tables 5–7.
This case probes whether the framework can stably recover a single, high-salience concept and propagate it to downstream fields. In Base, the model identifies “法” but often drifts to a generic “method/law” sense, yielding a literal translation that underspecifies the Legalist doctrine of public, codified, and equally applied standards. FT regularizes task formatting and school attribution (Legalism) while still under-articulating equal application. With CPT + FT, the representation of “法” shifts from a generic instrument to governance law that does not defer to rank, and the translation aligns with the simile of the inked cord. Full mainly improves terminological consistency and stylistic polish. Across the four tasks, the Legalist example shows that even in low-complexity settings the framework consistently populates all dictionary fields.
Here the challenge is simultaneous extraction and contextualization of “學/ 時習/ 說” (說 read yuè, “joy/delight”). Base can surface “學” and “時習” but routinely misreads “說” as “to speak”, which cascades into a flattened definition and a literal context translation. FT improves segmentation and school labeling (Confucianism), yet definitions remain largely procedural (“learn/review”) rather than value-laden. CPT + FT introduces Confucian moral-cultivation orientation, rendering “學” as cultivation-oriented study, “時習” as timely practice for internalization, and “說” as joy arising from learning; the context translation correspondingly captures rhetorical stance and scholarly register. Full adds stylistic standardization (e.g., stable rhetorical question form, consistent academic phrasing). This case demonstrates that the framework scales from single-term recovery to multi-term, context-dependent sense differentiation, filling the four dictionary fields with mutually coherent content.
This case targets the classic ambiguity of “道”: two philosophical nouns (道¹/ 常道=道³) vs. a verbal “to speak” (可道²). Base tends to tag all three tokens as terms and paraphrases “道” as “principle/rule,” losing the core ontological reading and the contrast encoded by “非常”. FT begins to suppress the verb reading inconsistently, but still wavers between “principle” and “norm.” With CPT + FT, the model correctly excludes “道²” from term identification, stabilizes “道¹/ 常道” as ontological Dao / Constant Dao, and produces a context translation that preserves the doctrinal contrast (“The Dao that can be spoken is not the Constant Dao”). Full strengthens term-role consistency and diction alignment with received exegesis. This case shows that the framework handles fine-grained, token-level disambiguation—a prerequisite for trustworthy lexicographic automation in classics where homographs are pervasive.
Taken together, the three cases establish a graded validation of the proposed framework. Moving from Base to FT, the system gains task conformance (well-formed outputs, better segmentation and labels) but remains semantically shallow. The transition from FT to CPT + FT delivers the qualitative leap: domain-shaped representations enable school-faithful, context-aware definitions and tran