AI tech can compress LLM chatbot conversation memory by 3–4 times

chatbots Credit: Google DeepMind from Pexels

Seoul National University College of Engineering announced that a research team led by Professor Hyun Oh Song from the Department of Computer Science and Engineering has developed a new AI technology called KVzip that intelligently compresses the conversation memory of large language model (LLM)-based chatbots used in long-context tasks such as extended dialog and document summarization. The study is published on the arXiv preprint server.

The term conversation memory refers to the temporary storage of sentences, questions, and responses that a chatbot maintains during interaction, which it uses to generate…

chatbots Credit: Google DeepMind from Pexels

The term conversation memory refers to the temporary storage of sentences, questions, and responses that a chatbot maintains during interaction, which it uses to generate contextually coherent replies. Using KVzip, a chatbot can compress this memory by eliminating redundant or unnecessary information that is not essential for reconstructing context. The technique allows the chatbot to retain accuracy while reducing memory size and speeding up response generation—a major step forward in efficient, scalable AI dialog systems.

Modern LLM chatbots perform tasks such as dialog, coding, and question answering using enormous contexts that can span hundreds or even thousands of pages. As conversations grow longer, however, the accumulated conversation memory increases computational cost and slows down response time.

To address this issue, researchers have developed memory compression methods that enable chatbots to retain only essential contextual information, rather than storing every detail of previous exchanges. However, most existing compression techniques are query-dependent, meaning they optimize memory only for the current question. When a new or follow-up question is asked, the chatbot’s performance typically deteriorates significantly.

To overcome this limitation, Professor Song’s team proposed KVzip, a novel method that effectively reduces the size of the conversation memory in long-context dialogs while maintaining the same level of accuracy. KVzip performs compression by retaining only the information necessary for context reconstruction, allowing the chatbot to handle multiple future queries without the need to recompress its memory each time.

In a wide range of tasks—including question answering, retrieval, reasoning, and code understanding—KVzip achieved 3–4× memory reduction and approximately 2× faster response times, all without any loss in accuracy. The technique also demonstrated scalability to extremely long contexts of up to 170,000 tokens using major open-source LLMs such as Llama 3.1, Qwen 2.5, and Gemma 3.

Moreover, KVzip maintained stable response quality across multiple rounds of diverse follow-up questions, overcoming the generalization limits of prior memory compression methods. Notably, the technology has been integrated into NVIDIA’s open-source KV cache compression library, KVPress, making it readily accessible for practical deployment.

In the near future, KVzip is expected to be widely adopted in enterprise-scale LLM systems, including retrieval-augmented generation (RAG) pipelines and personalized chatbot services. By reducing memory usage by 3–4× and shortening response latency by about 2×, the method allows servers to handle more concurrent users and longer conversations while significantly lowering operating costs.

![SNU researchers develop AI technology that compresses LLM chatbot ‘conversation memory’ by 3–4 times](https://scx1.b-cdn.net/csz/news/800a/2025/snu-researchers-develo-33.jpg “In long conversations, chatbots generate large “conversation memories” (KV). KVzip selectively retains only the information useful for any future question, autonomously verifying and compressing its memory for efficient reuse. Credit: Seoul National University College of Engineering / Hyun Oh Song’s Lab“) In long conversations, chatbots generate large “conversation memories” (KV). KVzip selectively retains only the information useful for any future question, autonomously verifying and compressing its memory for efficient reuse. Credit: Seoul National University College of Engineering / Hyun Oh Song’s Lab

Additionally, because the same compressed memory can be reused across different query types, there is no need for recompression at each question, and no risk of accuracy degradation in subsequent exchanges. These properties make KVzip particularly advantageous for mobile and edge environments, where computational and memory resources are limited, enabling stable long-context personalization capabilities even on-device.

Professor Hyun Oh Song, who advised the research, stated, “KVzip is significant in that it enables reusable compressed memory that retains only the most essential information, even in LLM agents requiring long contextual understanding.”

Dr. Jang-Hyun Kim, who is the main contributor of the project, stated, “KVzip can be seamlessly applied to real-world LLM applications and on-device systems to ensure consistent quality and improved speed for long-context interactions.”

The first author, Dr. Jang-Hyun Kim, will join the AI/ML Foundation Models team at Apple as a machine learning researcher.

The Machine Learning Laboratory led by Professor Song also had two additional papers accepted as poster presentations at NeurIPS 2025 and one paper published in the journal Transactions on Machine Learning Research (TMLR).

In the NeurIPS 2025 paper titled “Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment,” the team presented a theoretical analysis of optimal bitwidth allocation across layers in the quantization of large language models and introduced “Q-Palette,” a set of fractional-bit quantizers that realize this optimal allocation.

The method achieved a 36% improvement in inference speed compared to existing quantization approaches at equivalent performance levels.

Another NeurIPS 2025 paper, “Learning to Better Search with Language Models via Guided Reinforced Self-Training,” proposed Guided-ReST, a new reinforcement learning algorithm that enables large language models to autonomously learn improved reasoning and search strategies. On the challenging Countdown reasoning benchmark, Guided-ReST improved accuracy by 10% and reasoning efficiency by 50%.

In addition, the team’s TMLR paper, “Large-Scale Targeted Cause Discovery via Learning from Simulated Data,” introduced a supervised causal inference method for efficiently identifying causal variables of target factors. The proposed method scales linearly with the number of variables, making it suitable for large-scale systems, and achieved state-of-the-art causal discovery performance in gene regulatory network benchmarks.

More information: Jang-Hyun Kim et al, KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction, arXiv (2025). DOI: 10.48550/arxiv.2505.23416

Deokjae Lee et al, Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment, arXiv (2025). DOI: 10.48550/arxiv.2509.20214

Seungyong Moon et al, Learning to Better Search with Language Models via Guided Reinforced Self-Training, arXiv (2024). DOI: 10.48550/arxiv.2410.02992

Jang-Hyun Kim et al, Large-Scale Targeted Cause Discovery via Learning from Simulated Data, arXiv (2024). DOI: 10.48550/arxiv.2408.16218

Journal information: arXiv

Citation: AI tech can compress LLM chatbot conversation memory by 3–4 times (2025, November 7) retrieved 7 November 2025 from https://techxplore.com/news/2025-11-ai-tech-compress-llm-chatbot.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Similar Posts