Build self-hosted AI Agent with Ollama, Pydantic AI and Django Ninja

In this article, we will add structure to our initial setup and focus on the , or state management, level of the app. By the end of this read, our AI Agent will be able to keep track of conversations with the user. We will define the scope of its work, set personality traits, and provide it with the tools to retrieve and save data to a database. We will also create a second agent to help organize these memories.Memory is a crucial concept when designing AI Agents. Since Large Language Models (LLMs) are — meaning they process each request independently without retaining information from previous interactions — it is up to the developer to design the application layer that provides memory. While useful for illustration, it’s important to recognize that LLMs operate fundamentally differently from the human brain.There are a few main areas we can use to create the memory layer: Usually hidden from the user, this carries the bulk of the static context (personality, rules, and instructions). We will cover the basics of System Prompt usage here. This is the equivalent of short-term memory. Later in the article, I will show how to store chat messages and construct the chat API endpoint so it uses previous messages to add context. Vector stores act as long-term memory. They store data as numerical vectors (embeddings) and can quickly find information based on semantic similarity. We will implement a vector store to save space in the context window and reduce token usage.Function-Specific Memory: This is where Django adds immense value. For example, if we were creating an AI fitness coach, it would need to record and analyze weight changes. This allows the agent to maintain a persistent record of structured user data, enabling personalized responses.Combined, these elements create a robust memory layer for our agent.You need a clear idea of how your application should function before building the memory structure. In this article, I will mirror the most common design pattern found in commercial GPT apps, with a small twist.The base of our memory will be the Message records between the Agent and the User. Every User will have separate chat histories with any agents we create. Within the Agent-User exchange, we will introduce Conversation (sessions) for easier management of the context window. Finally, we will use embeddings to perform semantic searches over past conversations to retrieve relevant context.Let’s start by putting some structure into our project. We will focus on the chatbot app. We left off with the roll_a_dice example in api.py, which you can now delete. Now, create a new directory ../src/chatbot/services. Inside, create an empty init.py and a file named pydantic_agent_service.py. Add the following code:# ai-agent/backend/src/chatbot/services/pydantic_agent_service.pyfrom pydantic_ai import Agentfrom pydantic_ai.models.openai import OpenAIChatModelfrom pydantic_ai.providers.ollama import OllamaProvider def init(self, model_name: str, base_url: str, system_prompt: str = “”, max_tokens: int = 4096): self.model_name = model_name self.base_url = base_url self.max_tokens = max_tokens self.ollama_model = OpenAIChatModel( provider=OllamaProvider(base_url=base_url), self.ollama_model, system_prompt=system_prompt, ) def chat_with_agent(self, message: str): print(f“Chatting with agent. Message: {message}“) response = self.agent.run_sync(message) print(f“Timestamp: {response.timestamp()}”) print(f“Agent usage: {response.usage()}“) print(f“Message History: {response.all_messages()}”)This abstracts the Agent formation away from the API file, allowing it to be imported and reused anywhere.Now, let’s update api.py to create a /chat API endpoint that accepts and outputs a simple string# ai-agent/backend/src/chatbot/api.pyfrom ninja import Routerfrom .models import Agentfrom .schemas import ChatbotMessageSchema, ChatbotResponseSchemafrom .services.pydantic_agent_service import AiServicerouter = Router(tags=[“chatbot”])@router.post(“/chat”, response=ChatbotResponseSchema)def chat(request, payload: ChatbotMessageSchema): ai_service = AiService( model_name=“qwen3:latest”, base_url=os.environ.get(“OLLAMA_BASE_URL”), system_prompt=“You are a helpful AI Assistant”, ) response_text = ai_service.chat_with_agent( message=payload.message return ChatbotResponseSchema(response=response_text)If you run your server now, you can access the /chat endpoint and introduce yourself.However, if you ask a follow-up question, you will see what “stateless” means in practice, the agent won’t remember your name.The solution is passing previous messages to the new conversation. We can simulate this by manually passing a history list. Open pydantic_agent_service.py and update the chat_with_agent# ai-agent/backend/src/chatbot/services/pydantic_agent_service.pyfrom pydantic_ai import Agent, ModelRequest, ModelResponse, TextPart, UserPromptPart# add hardcoded 1st run messages to the second rundef chat_with_agent(self, message: str, message_history: list = None): print(f“Chatting with agent. Message: {message}“) ModelRequest(parts=[UserPromptPart(content=‘My name is Tom, what is your name?’)]), ModelResponse(parts=[TextPart(content=‘My name is Qwen, Nice to eet you, Tom! How can I assist you today?’)]), # This will be filtered out ] response = self.agent.run_sync(message, message_history=1st_run_history)Great! The Agent was able to recall the name “Tom” from the history we provided. Now, let’s structure our database tables to hold this information dynamically.We need to define Agent, Conversation, and Message models. We will use pgvector to store embeddings. Open chatbot/models.py and copy the code below# ai-agent/backend/src/chatbot/models.pyfrom django.db import modelsfrom django.contrib.auth.models import Userfrom pgvector.django import VectorFieldclass Agent(models.Model): name = models.CharField(max_length=100) model_name = models.CharField(max_length=100, default=“gemma3-32k:latest”) system_prompt = models.TextField(blank=True) max_context_tokens = models.IntegerField(default=32768) # Token limit for context window created_at = models.DateTimeField(auto_now_add=True) return self.nameclass Conversation(models.Model): agent = models.ForeignKey(Agent, on_delete=models.PROTECT, related_name=‘conversations’) user = models.ForeignKey(User, on_delete=models.CASCADE, null=True, blank=True) created_at = models.DateTimeField(auto_now_add=True) updated_at = models.DateTimeField(auto_now=True) summary = models.TextField(blank=True, null=True) summary_generated_at = models.DateTimeField(blank=True, null=True) summary_needs_regeneration = models.BooleanField(default=False, help_text=“Flag indicating new messages since last summary”) summary_message_count = models.IntegerField(default=0, help_text=“Number of messages included in current summary”) short_term_memory = models.JSONField(default=dict, blank=True) embedding = VectorField(dimensions=768, null=True, blank=True, help_text=“Embedding of the conversation summary”) username = self.user.username if self.user else “Anonymous” return f“User {username} conversation with {self.agent.name}. Conversation ID: {self.id}“ def mark_for_summarization(self): # Mark conversation as needing summary regeneration self.summary_needs_regeneration = True self.save(update_fields=[‘summary_needs_regeneration’, ‘updated_at’]) def update_summary(self, new_summary: str, embedding=None): # Update summary and reset regeneration flag self.summary = new_summary self.summary_generated_at = models.functions.Now() self.summary_needs_regeneration = False self.summary_message_count = self.messages.count() if embedding is not None: self.embedding = embeddingclass Message(models.Model): conversation = models.ForeignKey(Conversation, on_delete=models.CASCADE, related_name=‘messages’) user_prompt = models.TextField(help_text=“The user’s input message”) ai_response = models.TextField(help_text=“The AI’s final text response”) ai_reasoning = models.TextField(blank=True, null=True, help_text=“Internal thought process (ThinkingPart)”) # Metadata timestamp = models.DateTimeField(auto_now_add=True) embedding = VectorField(dimensions=768, null=True, blank=True) # For pgvector semantic search ordering = [‘timestamp’] username = self.conversation.user.username if self.conversation.user else “Anonymous” message_number = self.conversation.messages.filter(timestamp__lte=self.timestamp).count() return f“Message {message_number} in conversation {self.conversation.id} for {username} with {self.conversation.agent.name}“This file defines the database structure for your AI agent platform. Here’s a simplified look: Represents a single AI agent. It stores information like the agent’s name, the LLM it uses (model_name), a system_prompt to guide its behavior, and its context window size (max_context_tokens).: Represents a single conversation session between a user and an agent. It links to an Agent and a User. Crucially, it includes a short_term_memory JSON field to store the recent message history for that specific session. It also has fields for storing a summary of the conversation, which we will use later.: Represents a single turn in a conversation. It stores the user_prompt and the ai_response. It also has an embedding field, which stores a vector representation of the exchange for long-term memory and semantic search.Now, run the migrations. Note that pgvector requires a database extension. You may need to enable it in your database first (usually CREATE EXTENSION vector;).# ai-agent/backend/srcpython manage.py makemigrationspython manage.py migrateLet’s automate the creation of the first Agent. Run python manage.py makemigrations –empty chatbot. Open the created file and populate it:from django.db import migrationsdef create_default_agents(apps, schema_editor): Agent = apps.get_model(‘chatbot’, ‘Agent’) # Agent 1: General Assistant Agent.objects.create( name=“General Assistant”, model_name=“gemma3-32k”, system_prompt=“You are a helpful AI assistant. Answer questions clearly and concisely.”, max_context_tokens=32768def remove_default_agents(apps, schema_editor): # Optional: Logic to reverse the migration Agent = apps.get_model(‘chatbot’, ‘Agent’) Agent.objects.filter(name__in=[“General Assistant”, “Code Helper”]).delete()class Migration(migrations.Migration): (‘chatbot’, ‘0001_initial’), # Ensure this matches your previous migration file name migrations.RunPython(create_default_agents, remove_default_agents),This migration file automatically creates a default AI agent when your Django application is set up. The create_default_agents function gets the Agent model and creates a new instance with a name, a specific model_name, a system_prompt to define its behavior, and a max_context_tokens limit. This limit is used by our history processor to keep the conversation’s token count below the maximum supported by the model.Run python manage.py migrate to apply this.Register your models in admin.py for easy debugging:# ai-agent/backend/src/chatbot/models.pyfrom django.contrib import adminfrom .models import Agent, Conversation, Messageadmin.site.register(Agent)admin.site.register(Conversation)admin.site.register(Message)Now, when you visit :8000/admin, you will be able to view, create, and delete records for these models.Update schemas.py to handle Agent selection and Conversation IDs:# ai-agent/backend/src/chatbot/schemas.pyfrom typing import Optionalclass ChatbotMessageSchema(Schema): message: str conversation_id: Optional[int] = Noneclass ChatbotResponseSchema(Schema): response: str conversation_id: intclass AgentSchema(Schema): id: int model_name: strWe need a service to generate embeddings and search for similar messages. Create chatbot/services/embedding_service.py:# ai-agent/backend/src/chatbot/services/embedding_service.pyimport osfrom pgvector.django import CosineDistance def init(self, base_url: str = None): self.base_url = base_url or os.environ.get(“OLLAMA_HOST”, “http://localhost:11434”) def generate_embedding(self, text: str, model: str = “nomic-embed-text:v1.5”) -> list[float]: try: response = requests.post( f“{self.base_url}/api/embeddings“, “model”: model, }, ) response.raise_for_status() return response.json()[“embedding”] except requests.exceptions.RequestException as e: print(f“Error generating embedding: {e}“) def search_similar_messages(self, query_embedding: list[float], user=None, agent=None, limit: int = 5): from ..models import Message queryset = Message.objects.filter(embedding__isnull=False) if user: queryset = queryset.filter(conversation__user=user) if agent: queryset = queryset.filter(conversation__agent=agent) # Use pgvector’s CosineDistance annotation similar_messages = queryset.annotate( distance=CosineDistance(‘embedding’, query_embedding) ).order_by(‘distance’)[:limit]First update the pydantic_agent_service.py to accept a dynamic history list:# ai-agent/backend/src/chatbot/services/pydantic_agent_service.pyfrom pydantic_ai import Agent, RunContextfrom pydantic_ai.models.openai import OpenAIChatModelfrom pydantic_ai.providers.ollama import OllamaProviderfrom pydantic_ai.messages import ModelMessagefrom pydantic import TypeAdapter# Add AgentDeps class definitionclass AgentDeps: def init(self, max_tokens: int, current_tokens: int): self.max_tokens = max_tokens self.current_tokens = current_tokensdef pruning_processor(ctx: RunContext[AgentDeps], messages: list[ModelMessage]) -> list[ModelMessage]: # Prunes short-term memory if token usage is too high. max_tokens = ctx.deps.max_tokens current_tokens = ctx.deps.current_tokens threshold = max_tokens * 0.5 percentage = (current_tokens / max_tokens) * 100 print(f“DEBUG PRUNING: Current tokens: {current_tokens}/{max_tokens} ({percentage:.1f}%)”) print(f“DEBUG PRUNING: Threshold: {threshold} (50%)“) print(f“DEBUG PRUNING: Message count: {len(messages)}”) if current_tokens > threshold: keep_count = max(2, int(len(messages) * 0.2)) print(f“⚠️ PRUNING TRIGGERED! Keeping last {keep_count} of {len(messages)} messages.“) return messages[-keep_count:] print(f”✓ No pruning needed (under threshold)“) def init(self, model_name: str, base_url: str, system_prompt: str = “”, max_tokens: int = 32768): self.model_name = model_name self.max_tokens = max_tokens self.ollama_model = OpenAIChatModel( model_name=model_name, provider=OllamaProvider(base_url=base_url), ) # Create agent with static system prompt only self.agent = Agent( system_prompt=system_prompt, history_processors=[pruning_processor] def chat_with_agent(self, message: str, message_history: dict = None, context_text: str = None): print(f“Chatting with agent. Message: {message}“) if message_history is None: message_history = {“messages”: [], “usage”: {}} # Extract messages and usage from message_history dict messages_data = message_history.get(“messages”, []) usage = message_history.get(“usage”, {}) current_tokens = usage.get(“total_tokens”, 0) # Deserialize messages from dict to ModelMessage objects if messages_data: messages = TypeAdapter(list[ModelMessage]).validate_python(messages_data) print(f“DEBUG: Loaded {len(messages)} messages from history“) print(f“DEBUG: Message types: {[msg.kind for msg in messages]}“) else: print(f“DEBUG: No message history found, starting fresh conversation”) # Create deps with current token usage deps = AgentDeps(max_tokens=self.max_tokens, current_tokens=current_tokens) message = f“Context:\n{context_text}\n\nUser Message:{message}“ response = self.agent.run_sync( message, message_history=messages, deps=deps, print(f“Agent usage: {response.usage()}“) print(f“Total messages after run: {len(response.all_messages())}”)This file encapsulates the logic for interacting with the LLM. Here’s a breakdown:: This class is responsible for configuring and running the pydantic-ai agent.: It sets up the connection to our self-hosted LLM via OllamaProvider and initializes the Agent. Crucially, it attaches a history_processors function, prunning_processor.: This function is the core of our short-term memory management. Before each LLM call, it checks the current token count of the conversation history. If the count exceeds a threshold (50% of the model’s maximum), it prunes the history, keeping only the most recent messages. This prevents the conversation from failing due to an oversized context window.: This method manages the call to the LLM. It takes the user’s message, the existing message history, and any long-term memory context. It bundles the token data into an AgentDeps object for the pruning_processor and then executes the chat, returning the complete response object.Finally, rewrite api.py to orchestrate memory retrieval and storage.# ai-agent/backend/src/chatbot/api.pyfrom ninja import Routerfrom django.shortcuts import get_object_or_404from pydantic import TypeAdapterfrom pydantic_ai import ModelMessage, ThinkingPartfrom .models import Agent, Conversation, Messagefrom .schemas import ChatbotMessageSchema, ChatbotResponseSchema, AgentSchemafrom .services.pydantic_agent_service import AiServicefrom .services.embedding_service import EmbeddingServicerouter = Router(tags=[“chatbot”])@router.get(“/agents”, response=list[AgentSchema])def list_agents(request): return Agent.objects.all()@router.post(“/chat”, response=ChatbotResponseSchema)def chat(request, payload: ChatbotMessageSchema): agent = get_object_or_404(Agent, id=payload.agent_id) # Get or create conversation if payload.conversation_id: conversation = get_object_or_404(Conversation, id=payload.conversation_id, agent=agent) print(f“DEBUG: Using existing conversation {conversation.id}“) conversation = Conversation.objects.create( user=request.user print(f“DEBUG: Created new conversation {conversation.id}”) # Initialize embedding service embedding_service = EmbeddingService( base_url=os.environ.get(“OLLAMA_HOST”) ) # Generate embedding for the user’s message (for searching) user_embedding = embedding_service.generate_embedding( model=“nomic-embed-text:v1.5” # Search for relevant context across ALL user’s conversations with this agent similar_messages = embedding_service.search_similar_messages( query_embedding=user_embedding, user=request.user, limit=3 # Prepare RAG Context (Long Term Memory) context_text = “” context_parts = [] for msg in similar_messages: context_parts.append(f“User: {msg.user_prompt}\nAssistant: {msg.ai_response}“) context_text = “\n—\n”.join(context_parts) print(f“DEBUG: Found {len(similar_messages)} similar messages for RAG context“) ai_service = AiService( model_name=agent.model_name, base_url=os.environ.get(“OLLAMA_BASE_URL”), system_prompt=agent.system_prompt, max_tokens=agent.max_context_tokens # Debug: Check existing message history if conversation.short_term_memory: existing_messages = conversation.short_term_memory.get(“messages”, []) existing_usage = conversation.short_term_memory.get(“usage”, {}) print(f“DEBUG: Existing message count: {len(existing_messages)}“) print(f“DEBUG: Existing token usage: {existing_usage}”) print(f“DEBUG: No existing message history“) # Chat using the stored short-term memory + injected RAG context response = ai_service.chat_with_agent( message_history=conversation.short_term_memory, context_text=context_text ) # Debug: Check response usage print(f“DEBUG: Response usage - Input: {response.usage().input_tokens}, Output: {response.usage().output_tokens}, Total: {response.usage().input_tokens + response.usage().output_tokens}“) print(f“DEBUG: Message count after response: {len(response.all_messages())}”) # Update conversation state (Short Term Memory) - store as dict with messages and usage usage_dict = { “input_tokens”: response.usage().input_tokens, “output_tokens”: response.usage().output_tokens, “requests”: response.usage().requests, “total_tokens”: response.usage().input_tokens + response.usage().output_tokens all_messages = response.all_messages() conversation.short_term_memory = { “messages”: TypeAdapter(list[ModelMessage]).dump_python(all_messages, mode=‘json’), “usage”: usage_dict conversation.save() print(f“DEBUG: Saved {len(all_messages)} messages to conversation“) print(f“DEBUG: Total tokens in conversation: {usage_dict[‘total_tokens’]} / {agent.max_context_tokens}“) print(f“DEBUG: Token usage: {(usage_dict[‘total_tokens’] / agent.max_context_tokens) * 100:.1f}%”) # Extract reasoning if present reasoning = None for part in response.all_messages()[-1].parts: if isinstance(part, ThinkingPart): break # Generate embedding for the COMPLETE exchange (user + AI response) exchange_text = f“User: {payload.message}\nAssistant: {response.output}“ exchange_embedding = embedding_service.generate_embedding( text=exchange_text, model=“nomic-embed-text:v1.5” ) # Save message to database (Long Term Memory / Logs) Message.objects.create( conversation=conversation, user_prompt=payload.message, ai_response=response.output, ai_reasoning=reasoning, timestamp=response.timestamp(), embedding=exchange_embedding conversation.mark_for_summarization() return ChatbotResponseSchema( response=response.output, conversation_id=conversation.idThis api.py file now acts as the central orchestrator for the agent’s memory system. Placing the Retrieval-Augmented Generation (RAG) logic directly within the API view is not ideal for a production application. We have done so here for clarity and to demonstrate the end-to-end flow in a single place. In future articles, we will refactor this logic into a dedicated RAG service for better organization and re-usability.Here is a step-by-step guide to how the chat endpoint works:Identify Agent and Conversation: It first identifies the correct Agent and retrieves the ongoing Conversation session. If no session exists, it creates a new one, linking the agent and the user.Retrieve Long-Term Memory (RAG): It generates an embedding for the user’s new message and uses the EmbeddingService to search the database for the most semantically similar past messages. This retrieved information forms the context_text, which serves as the agent’s long-term memory for the current turn.: It initializes the AiService with the specific agent’s configuration (model, system prompt, token limit). It then calls chat_with_agent, passing the user’s message, the short_term_memory from the current Conversation object, and the long-term context_text retrieved in the previous step.: After the LLM responds, the endpoint takes the complete, updated message history from the response object. This history, along with the new token usage statistics, is saved back into the conversation.short_term_memory field, preparing it for the next turn.Store in Long-Term Memory: Finally, it creates a new Message record to log the exchange. Crucially, it generates a new embedding for the (user prompt + AI response). Storing the embedding of the complete exchange, rather than just the prompt, ensures that future searches can find more contextually complete and relevant examples.With this structure, the agent can now maintain both a short-term conversational flow and recall relevant information from its long-term memory across all past conversations.Testing and Summarization AgentsNow that our core memory system is in place, let’s build some specialized tools to test its effectiveness and introduce a more advanced memory management technique: conversation summarization. We will create these tools as Django management commands.A Django management command is a script that can be run from the command line using python manage.py <command_name>. They are perfect for automating tasks like testing, database maintenance, or, in our case, running specialized AI agents. To create a command, you must follow a specific directory structure within your app:chatbot/├── management/│ └── commands/│ └── your_command.pyThe management and commands directories must each contain an empty init.py file. This tells Python to treat them as packages, which is essential for Django to discover and register your command.To see how our short-term and long-term memory work together, we need to generate conversational data. Instead of manually typing, we can create an AI agent whose sole purpose is to act as a test user.Create a new file: test_agent.py.# ai-agent/backend/src/chatbot/management/commands/test_agent.pyimport requestsfrom django.core.management.base import BaseCommandfrom pydantic_ai import Agentfrom pydantic_ai.models.openai import OpenAIChatModelfrom pydantic_ai.providers.ollama import OllamaProviderclass Command(BaseCommand): help = ‘Run test agent to interact with chatbot API’ def add_arguments(self, parser): parser.add_argument(‘–turns’, type=int, default=15, help=‘Number of conversation turns’) parser.add_argument(‘–username’, type=str, default=‘TestUser’, help=‘Username for authentication’) parser.add_argument(‘–password’, type=str, default=‘Strong#Pass1’, help=‘Password for authentication’) parser.add_argument(‘–agent-id’, type=int, default=1, help=‘Agent ID to chat with’) def handle(self, *args, **options): API_BASE_URL = “http://192.168.1.229:8080/api” USERNAME = options[‘username’] PASSWORD = options[‘password’] AGENT_ID = options[‘agent_id’] NUM_TURNS = options[‘turns’] # Create Test User AI Agent ollama_model = OpenAIChatModel( model_name=“gemma3:latest”, provider=OllamaProvider(base_url=os.environ.get(“OLLAMA_BASE_URL”)), ollama_model, system_prompt=“”“You are a realistic user testing a chatbot. Your goal is to have natural, coherent conversations.- Start with greetings or simple questions- Ask follow-up questions based on the chatbot’s responses- Reference information from earlier in the conversation- Occasionally introduce new topics naturally- Show curiosity and engagement- Keep messages conversational and brief (1-3 sentences)- Act like a real human user would”“” response = requests.post( f“{API_BASE_URL}/token/pair“, json={“username”: USERNAME, “password”: PASSWORD} response.raise_for_status() self.stdout.write(self.style.SUCCESS(f“✓ Authenticated as {data[‘username’]}“)) def chat_with_api(message, access_token, conversation_id=None): headers = {“Authorization”: f“Bearer {access_token}“} payload = {“message”: message, “agent_id”: AGENT_ID} if conversation_id: payload[“conversation_id”] = conversation_id response = requests.post(f“{API_BASE_URL}/chatbot/chat“, json=payload, headers=headers) response.raise_for_status() def generate_next_message(chatbot_response): nonlocal test_user_messages prompt = f“The chatbot just said: ‘{chatbot_response}’\n\nWhat do you reply?“ result = test_user_agent.run_sync(prompt, message_history=test_user_messages) test_user_messages = list(result.all_messages()) return result.output self.stdout.write(f“\nTest User Agent starting…“) self.stdout.write(f“Username: {USERNAME}”) self.stdout.write(f“Agent ID: {AGENT_ID}“) self.stdout.write(f“Turns: {NUM_TURNS}\n”) access_token = get_access_token() conversation_id = None current_message = “Hi! How are you today?” for turn in range(NUM_TURNS): self.stdout.write(f“\n{‘=’*60}“) self.stdout.write(self.style.WARNING(f”[Turn {turn + 1}]“)) self.stdout.write(f”{‘=’*60}“) self.stdout.write(f“Test User: {current_message}”) result = chat_with_api(current_message, access_token, conversation_id) chatbot_response = result[“response”] conversation_id = result[“conversation_id”] self.stdout.write(f“Chatbot: {chatbot_response}“) self.stdout.write(self.style.SUCCESS(f“Conversation ID: {conversation_id}”)) time.sleep(1.0) current_message = generate_next_message(chatbot_response) self.stdout.write(f“\n{‘=’*60}“) self.stdout.write(self.style.SUCCESS(f“Test completed! Conversation ID: {conversation_id}”)) self.stdout.write(f“{‘=’*60}“)This command simulates a full conversation, using its own AI agent to generate realistic user replies. Run it to generate some data:python manage.py test_agent –turns 10 –username –passwordI like to change the behaviour of default agent for testing to generate more tokens quicker and see how memory trimming performs. You can do so by changing the System Promp property in the admin panel:As it runs, you will see the debug messages from api.py and pydantic_ai_service.py in your Django server console, showing the RAG context being retrieved and the short-term memory being pruned.The test agent’s output will show the conversation flow:The system works! The agent successfully uses both short-term memory (from the Conversation object) and long-term memory (via RAG from the Message embeddings). However, this granular, message-level vector search isn’t always ideal. It can sometimes pull in disconnected snippets of conversation that are semantically similar but contextually irrelevant, potentially confusing the agent more than helping it. A better approach would be to retrieve context from a more coherent, high-level source.This is where conversation summarization comes in. Instead of searching through individual messages, we can summarize entire past conversations and search through those summaries. This provides the agent with a condensed, high-level overview of a past interaction, which is often more useful for recalling broad context.Let’s create a command for this. In the commands directory create the file summarize_conversations.py# ai-agent/backend/src/chatbot/management/commands/test_agent.pyfrom django.core.management.base import BaseCommandfrom django.db.models import Qfrom pydantic_ai import Agentfrom pydantic_ai.models.openai import OpenAIChatModelfrom pydantic_ai.providers.ollama import OllamaProviderfrom chatbot.models import Conversationfrom chatbot.services.embedding_service import EmbeddingServiceclass Command(BaseCommand): help = ‘Summarize all conversations that need it’ def handle(self, *args, options): # Create summarization agent ollama_model = OpenAIChatModel( model_name=“gemma3-128k”, provider=OllamaProvider(base_url=os.environ.get(“OLLAMA_BASE_URL”)), ) summarization_agent = Agent( ollama_model, system_prompt=“”“You are a professional conversation summarizer. Create clear, concise summaries of conversations. - Identify and highlight key topics discussed - Capture important facts, decisions, or information shared - Note any questions asked or problems solved - Preserve context and relationships between topics - Keep summaries focused and structured - Use bullet points for clarity when appropriate - Write in third person perspective The conversation covered [main topics]. The user discussed [key points]. Important information shared includes [facts/decisions]. [Any notable context or outcomes].”“” embedding_service = EmbeddingService(base_url=os.environ.get(“OLLAMA_HOST”)) # Find conversations that need summarization conversations = Conversation.objects.filter( Q(summary__isnull=True) | Q(summary_needs_regeneration=True) ).distinct() total = conversations.count() self.stdout.write(f’\n📊 Found {total} conversation(s) to summarize\n’) self.stdout.write(self.style.WARNING(‘✓ All conversations are up to date’)) for conversation in conversations: messages = conversation.messages.all() if messages.count() == 0: self.stdout.write(f’\n📝 Conversation {conversation.id} ({messages.count()} messages)’) # Build conversation text conversation_text = “\n\n”.join([ f“User: {msg.user_prompt}\nAI: {msg.ai_response}“ ]) prompt = f“Summarize this conversation:\n\n{conversation_text}“ response = summarization_agent.run_sync(prompt) summary = response.output embedding = embedding_service.generate_embedding( model=“nomic-embed-text:v1.5” conversation.update_summary(summary, embedding) self.stdout.write(self.style.SUCCESS(f’✅ Summarized’)) self.stdout.write(f’ {summary[:100]}…‘) self.stdout.write(self.style.ERROR(f’❌ Error: {e}‘)) self.stdout.write(f’\n✨ Done! Summarized {summarized} conversations\n’)This command defines a specialized summarization_agent, finds conversations that need updating, generates a summary, creates an embedding of that summary, and saves both to the Conversation model. Please note that I’m using 128k context window for this task due to potential large amount of data to summarize. Run it to process the conversations from the test agent:python manage.py summarize_conversationsSummarization is a great example of a service that would benefit from running at scheduled intervals. In future articles I will show you how to achieve this with Django Q.Implementing the New RAG StrategyNow we can replace our message-based RAG with a more powerful summary-based RAG. First, let’s add a new method to our EmbeddingService to search for conversations.# ai-agent/backend/src/chatbot/services/embedding_service.pyimport osfrom pgvector.django import CosineDistance def init(self, base_url: str = None): self.base_url = base_url or os.environ.get(“OLLAMA_HOST”, “http://localhost:11434”) def generate_embedding(self, text: str, model: str = “nomic-embed-text:v1.5”) -> list[float]: try: response = requests.post( f“{self.base_url}/api/embeddings“, “model”: model, }, ) response.raise_for_status() return response.json()[“embedding”] except requests.exceptions.RequestException as e: print(f“Error generating embedding: {e}“) def search_similar_messages(self, query_embedding: list[float], user=None, agent=None, limit: int = 5): from ..models import Message queryset = Message.objects.filter(embedding__isnull=False) if user: queryset = queryset.filter(conversation__user=user) if agent: queryset = queryset.filter(conversation__agent=agent) # Use pgvector’s CosineDistance annotation similar_messages = queryset.annotate( distance=CosineDistance(‘embedding’, query_embedding) ).order_by(‘distance’)[:limit] def search_similar_conversations(self, query_embedding: list[float], user=None, agent=None, limit: int = 1): from ..models import Conversation queryset = Conversation.objects.filter(embedding__isnull=False) queryset = queryset.filter(user=user) queryset = queryset.filter(agent=agent) similar_conversations = queryset.annotate( distance=CosineDistance(‘embedding’, query_embedding) ).order_by(‘distance’)[:limit] return similar_conversationsWith this in place, we can update api.py to use this new method. Instead of injecting three separate messages, we will now find the single most relevant past conversation and inject its summary as context.# ai-agent/backend/src/chatbot/api.py # Search for relevant context across ALL user’s conversations with this agent similar_messages = embedding_service.search_similar_messages( query_embedding=user_embedding, user=request.user, limit=3 # Prepare RAG Context (Long Term Memory) context_text = “” context_parts = [] for msg in similar_messages: context_parts.append(f“User: {msg.user_prompt}\nAssistant: {msg.ai_response}“) context_text = “\n—\n”.join(context_parts) print(f“DEBUG: Found {len(similar_messages)} similar messages for RAG context“) # — RAG Strategy Update — # Search for the most relevant conversation summary, excluding the current one similar_conversations = embedding_service.search_similar_conversations( query_embedding=user_embedding, agent=agent, ) if similar_conversations: # Get the most similar conversation that is not the current one most_similar_convo = next((c for c in similar_conversations if c.id != conversation.id), None) context_text = f“Here is a summary of a relevant past conversation:\n{most_similar_convo.summary}“ print(f“DEBUG: Found relevant conversation {most_similar_convo.id} for RAG context“)This new approach provides a much more coherent and condensed block of long-term memory for the agent to work with. By feeding it a summary of a relevant past conversation, we give it the “gist” of a previous interaction, leading to more contextually aware and intelligent responses without the noise of individual, out-of-context messages.Function-Specific Memory: Structured Data StorageWe’ve covered short-term memory (conversation history), long-term memory (vector search over past conversations), and even how to summarize conversations for better context retrieval. But there’s one more crucial type of memory that makes AI agents truly powerful: structured, function-specific data storage.Let’s assume you want to create an AI agent who will track your weight and provide trend analysis. In theory, our chat history and vector store will have all the required data to perform this task. However, if you rely solely on this type of memory, as the number of measurements grows, it gets harder and harder for the Agent to retrieve accurate data. Vector search might recall that “the user mentioned 75kg on Tuesday,” but it can’t efficiently answer “show me all measurements from the last 30 days” or “calculate my average weight this month.”This is where Django truly shines. By creating dedicated models and APIs for specific domains, we can give our AI agent the ability to not just remember conversations, but to actively track, analyze, and act upon structured user data with precision and reliability. Let’s build a practical example: a weight tracking feature that allows our AI agent to record and retrieve weight measurements.Two Approaches to Agent ToolsBefore we dive in, it’s important to understand that there are two ways to give your AI agent access to structured data:Approach 1: Direct Database Access (What we’ll use)Agent tools directly query Django ORMFaster (no HTTP overhead)Perfect when agent and Django are in the same processBest for getting started and for most simple use casesApproach 2: API-Based ToolsAgent tools make HTTP requests to your API endpointsMore modular and reusableCan be used by external agents or servicesBetter for microservices architectureAdds complexity (auth, error handling, network issues)For this article, we’ll use because simplicity is key. We’ll explore API-based tools in future articles when we cover multi-agent systems and external integrations.First, create a new Django app called fitness:cd backend/srcpython manage.py startapp fitnessDon’t forget to add ‘fitness’ to your INSTALLED_APPS in settings.py.Now, let’s define a simple model to track weight measurements. Open fitness/models.py:# ai-agent/backend/src/fitness/models.pyfrom django.db import modelsfrom django.contrib.auth.models import Userclass WeightMeasurement(models.Model): user = models.ForeignKey(User, on_delete=models.CASCADE, related_name=‘weight_measurements’) weight = models.DecimalField(max_digits=5, decimal_places=2, help_text=“Weight in kg”) measured_at = models.DateField(help_text=“Date of measurement”) created_at = models.DateTimeField(auto_now_add=True) ordering = [‘-measured_at’] unique_together = [‘user’, ‘measured_at’] return f“{self.user.username} - {self.weight}kg on {self.measured_at}“python manage.py makemigrationspython manage.py migrateRegister the model in fitness/admin.py for easy inspection:# ai-agent/backend/src/fitness/admin.pyfrom django.contrib import adminfrom .models import WeightMeasurementadmin.site.register(WeightMeasurement)That’s all we need! No API endpoints, no schemas — just a clean Django model. The agent will interact with this data directly through its tools.Giving the Agent Access to Fitness DataNow comes the exciting part: teaching our AI agent to use this model. We do this by defining in Pydantic AI. A tool is a Python function that the agent can call when it needs to perform a specific action.Update chatbot/services/pydantic_agent_service.py:# ai-agent/backend/src/chatbot/services/pydantic_ai_service.pyfrom pydantic_ai import Agent, RunContextfrom pydantic_ai.models.openai import OpenAIChatModelfrom pydantic_ai.providers.ollama import OllamaProviderfrom pydantic_ai.messages import ModelMessagefrom pydantic import TypeAdapterfrom typing import Optionalfrom datetime import datetime, date# Add AgentDeps class definitionclass AgentDeps: def init(self, max_tokens: int, current_tokens: int, user=None): self.max_tokens = max_tokens self.current_tokens = current_tokens self.user = userdef pruning_processor(ctx: RunContext[AgentDeps], messages: list[ModelMessage]) -> list[ModelMessage]: # Prunes short-term memory if token usage is too high. max_tokens = ctx.deps.max_tokens current_tokens = ctx.deps.current_tokens threshold = max_tokens * 0.5 percentage = (current_tokens / max_tokens) * 100 print(f“DEBUG PRUNING: Current tokens: {current_tokens}/{max_tokens} ({percentage:.1f}%)”) print(f“DEBUG PRUNING: Threshold: {threshold} (50%)“) print(f“DEBUG PRUNING: Message count: {len(messages)}”) if current_tokens > threshold: keep_count = max(2, int(len(messages) * 0.2)) print(f“⚠️ PRUNING TRIGGERED! Keeping last {keep_count} of {len(messages)} messages.“) return messages[-keep_count:] print(f”✓ No pruning needed (under threshold)“) def init(self, model_name: str, base_url: str, system_prompt: str = “”, max_tokens: int = 32768): self.model_name = model_name self.max_tokens = max_tokens self.ollama_model = OpenAIChatModel( model_name=model_name, provider=OllamaProvider(base_url=base_url), ) self.ollama_model, system_prompt=system_prompt, deps_type=AgentDeps, history_processors=[pruning_processor] ) self._register_fitness_tools() def _register_fitness_tools(self): @self.agent.tool def record_weight(ctx: RunContext[AgentDeps], weight: float, measured_at: str) -> str: “”“ Record a weight measurement for the user. weight: Weight in kilograms (e.g., 75.5) measured_at: Date of measurement in YYYY-MM-DD format (e.g., “2025-12-12”) “”“ from fitness.models import WeightMeasurement measurement_date = datetime.strptime(measured_at, “%Y-%m-%d”).date() measurement = WeightMeasurement.objects.create( user=ctx.deps.user, measured_at=measurement_date return f“Successfully recorded weight of {weight}kg on {measured_at}“ return f“Error recording weight: {str(e)}“ def get_weight_history( ctx: RunContext[AgentDeps], from_date: Optional[str] = None, to_date: Optional[str] = None, last_n: Optional[int] = None “”“ Retrieve weight measurement history for the user. from_date: Start date in YYYY-MM-DD format (optional) to_date: End date in YYYY-MM-DD format (optional) last_n: Number of most recent measurements to retrieve (optional) from fitness.models import WeightMeasurement queryset = WeightMeasurement.objects.filter(user=ctx.deps.user) queryset = queryset.filter(measured_at__gte=datetime.strptime(from_date, “%Y-%m-%d”).date()) queryset = queryset.filter(measured_at__lte=datetime.strptime(to_date, “%Y-%m-%d”).date()) queryset = queryset[:last_n] measurements = list(queryset) return “No weight measurements found for the specified criteria.” result = “Weight measurements:\n” for m in measurements: result += f“- {m.measured_at}: {m.weight}kg\n“ except Exception as e: return f“Error retrieving weight history: {str(e)}“ def chat_with_agent(self, message: str, message_history: dict = None, context_text: str = None, user=None): print(f“Chatting with agent. Message: {message}“) if message_history is None: message_history = {“messages”: [], “usage”: {}} # Extract messages and usage from message_history dict messages_data = message_history.get(“messages”, []) usage = message_history.get(“usage”, {}) current_tokens = usage.get(“total_tokens”, 0) # Deserialize messages from dict to ModelMessage objects if messages_data: messages = TypeAdapter(list[ModelMessage]).validate_python(messages_data) print(f“DEBUG: Loaded {len(messages)} messages from history“) print(f“DEBUG: Message types: {[msg.kind for msg in messages]}“) else: print(f“DEBUG: No message history found, starting fresh conversation”) # Create deps with current token usage and user deps = AgentDeps(max_tokens=self.max_tokens, current_tokens=current_tokens, user=user) message = f“Context:\n{context_text}\n\nUser Message:{message}“ response = self.agent.run_sync( message, message_history=messages, deps=deps, print(f“Agent usage: {response.usage()}“) print(f“Total messages after run: {len(response.all_messages())}”)We updated AgentDeps to include a user object, which allows our tools to access the current user’s dataWe created two tools decorated with @self.agent.tool:record_weight: Saves a weight measurement directly to the databaseget_weight_history: Retrieves weight records with flexible filteringBoth tools use Django’s ORM (WeightMeasurement.objects) to interact with the databaseThe tools return formatted strings that the agent can use in its responsesFinally, update the chat endpoint in chatbot/api.py to pass the user to the agent:# ai-agent/backend/src/chatbot/api.py response = ai_service.chat_with_agent( message_history=conversation.short_term_memory, context_text=context_text, user=request.user # Add this line# …existing code…Crafting an Effective System PromptNow that our agent has access to fitness tracking tools, we need to update its system prompt to guide its behavior. The system prompt is arguably the most important aspect of an AI agent — it defines personality, capabilities, and behavioral rules.A good system prompt for a fitness-focused agent should:Define the agent’s role and personalityList available capabilities explicitlySet behavioral guidelines and constraintsProvide examples of good interactionsEstablish data handling and privacy rulesHere’s an example of a well-structured system prompt. Update your Agent in the database (via Django admin) with this prompt:You are FitCoach, a supportive and knowledgeable AI fitness assistant. Your primary role is to help users track their weight and understand their progress.- Record weight measurements with dates- Retrieve weight history with flexible date ranges- Analyze trends and provide insights- Offer encouragement and celebrate progress1. Always be supportive and non-judgmental about weight2. When a user mentions their weight, ask if they’d like you to record it3. Use the tools available to you - don’t just remember data, actually save it4. When showing weight history, highlight trends (gaining, losing, maintaining)5. Respect privacy - only access this user’s data, never mention other users6. If a user asks about weight from a specific date, use get_weight_history with appropriate filters7. Keep responses concise but encouragingExample Interactions:**User: “I weighed myself today, I’m 75kg“You: [Call record_weight(75.0, “2025-12-12”)] “Perfect! I’ve recorded 75kg for today. Keep up the good work!“User: “What’s my weight been like this month?“You: [Call get_weight_history(from_date=“2025-12-01”)] [Analyze results] “Here’s your weight trend for December…”- Use the actual date mentioned by the user, or today’s date if not specified- When retrieving history, use appropriate filters (from_date, to_date, or last_n)- Provide context with the data - don’t just list numbersIn admin dashboard, I created a new AI AgentI like to use Qwen3 in development thanks to its thinking capabilities that greatly improves debugging.After saving the changes you can see all available Agents by sending GET request to chatbot/agents endpoint:Now I can ask it to record my weight:and the record is kept in the DBWe can also inspect the message exchange, including the thinking part:After adding few more records, we can ask the agent to retrieve records and tell us about them. This is the dataAnd this is the Agent in actionCongratulations! You’ve just built an AI agent system with a multi-layered memory architecture. Let’s recap what we’ve accomplished in this article:Memory Layers We Implemented:: Conversation history stored in the Conversation model’s short_term_memory JSON field, with automatic token-based pruning to prevent context window overflow: Vector embeddings stored in pgvector, enabling semantic search across all past conversations: Automated conversation summaries that provide coherent, high-level context for RAG retrieval: Structured data storage using Django models, allowing the agent to track domain-specific information (like weight measurements) with precision: Agent, Conversation, Message, and WeightMeasurement models that form the backbone of our memory system: A clean abstraction that handles agent initialization, tool registration, and conversation management: Semantic search powered by Ollama’s embedding models and PostgreSQL’s pgvector extension: Automated testing and summarization tools to maintain and validate the system: RESTful endpoints for interacting with agents and conversationsThis foundation gives you a production-ready starting point for building intelligent, context-aware AI agents that can truly assist users over extended periods of time.While our system is functional, there’s significant room for improvement in terms of organization, scalability, and advanced features. In the next article, we’ll tackle:1. Better Project StructureRefactoring RAG logic from api.py into a dedicated RAGServiceOrganizing services into logical modulesImplementing proper separation of concernsCreating reusable components for agent orchestration2. AI Agent Workflows & OrchestrationBuilding multi-step agent workflows (research → plan → execute)Coordinating multiple specialized agents working togetherImplementing agent handoffs and delegation patternsCreating supervisor agents that manage other agents3. Tool Calling Without Native SupportMaking agents work with models that don’t support function callingImplementing tool calling via structured output parsingCreating fallback strategies for tool executionBuilding prompt-based tool selection mechanisms4. Performance & Production ConsiderationsCaching strategies for embeddings and responsesAsync processing for long-running agent tasksRate limiting and cost managementMonitoring and observabilityThank you for reading and see you next time!

Similar Posts