How we built Deepseek Copilot for Production at Scale
35 min readJust now
–
Press enter or click to view image in full size
Source: Diagram by the author.
Introduction
The past year has seen explosive interest in generative AI, but enterprises have learned that true value comes from pairing large language models with their own proprietary data. Retrieval-Augmented Generation (RAG) has emerged as a powerful technique to give AI an “open-book test” — allowing models to consult internal knowledge bases for up-to-date, factual information. The promise is enticing: employees and customers can get instant, accurate answers from a chatbot or assistant (a “copilot”) that knows your business inside and out, from policy manuals to product documentation. However, building such an…
How we built Deepseek Copilot for Production at Scale
35 min readJust now
–
Press enter or click to view image in full size
Source: Diagram by the author.
Introduction
The past year has seen explosive interest in generative AI, but enterprises have learned that true value comes from pairing large language models with their own proprietary data. Retrieval-Augmented Generation (RAG) has emerged as a powerful technique to give AI an “open-book test” — allowing models to consult internal knowledge bases for up-to-date, factual information. The promise is enticing: employees and customers can get instant, accurate answers from a chatbot or assistant (a “copilot”) that knows your business inside and out, from policy manuals to product documentation. However, building such an enterprise RAG system without compromise is no trivial task. Many early solutions cut corners on data quality, scalability, or security, leading to brittle systems that hallucinate answers or expose sensitive data.
DeepSeek Copilot is our answer to this challenge — a production-grade RAG platform designed with no compromises on accuracy, performance, or governance. In this article, we share a comprehensive journey of architecting DeepSeek Copilot for use at scale in a large organization. We’ll begin with a strategic overview for technology leaders on why this approach matters, then dive deep into the technical architecture and code. Along the way, we’ll address each major challenge in building a robust RAG pipeline: ingestion of enterprise data, efficient retrieval and ranking, mitigation of AI hallucinations, observability for monitoring and tracing, horizontal scalability, and strict security/compliance controls. We’ll also weigh the Build vs. Buy decision that many organizations face when implementing an enterprise RAG solution, examining the costs and trade-offs of rolling your own system like DeepSeek Copilot versus purchasing a managed platform.
By the end of this deep dive, you should appreciate not only what a best-in-class RAG system looks like, but why each component is essential for an enterprise-ready “copilot.” We’ll illustrate key points with code snippets-from data pipelines and hybrid search queries to Kubernetes deployment manifests and tracing hooks-highlighting practical implementation details. DeepSeek Copilot demonstrates that it’s possible to build a highly customized, scalable RAG system that delivers accurate, grounded knowledge to users without compromising on the qualities that enterprises care about: factual correctness, responsiveness, uptime, and data security.
(To keep this discussion concrete, we’ll reference the DeepSeek Copilot architecture and code. However, the principles apply generally to enterprise RAG systems. All significant code and configuration for DeepSeek Copilot will be available in the open-source GitHub repository.)
Press enter or click to view image in full size
Source: Diagram by the author.
Data Ingestion and Indexing Pipeline
Building an enterprise RAG system starts with assembling a comprehensive, high-quality knowledge base from the organization’s data. In DeepSeek Copilot, the ingestion pipeline is responsible for loading raw content from diverse sources, preprocessing it into a suitable format, and indexing it into a vector database for efficient retrieval. This stage is foundational — the old adage “garbage in, garbage out” applies strongly to RAG. If your document corpus is incomplete or poorly processed, even the best language model will give unsatisfactory answers. We set several goals for the ingestion pipeline: maximize coverage of relevant data sources, preserve the context and structure of documents, enrich the data with metadata, and ensure the index is updated as information changes.
Connecting to enterprise data sources: In a real deployment, relevant data may span file repositories (PDFs, Word documents), wikis or SharePoint sites, databases, email archives, and more. DeepSeek Copilot uses connectors and ETL jobs for each source type. For example, we built connectors to SharePoint and Confluence to pull documents periodically, as well as jobs to export records from certain SQL databases. All content flows into a unified processing pipeline. We also implemented scheduling and change detection — for instance, monitoring file modification timestamps — so that new or updated documents are picked up and re-indexed without manual intervention. This avoids the RAG system answering with stale information. In production, an ingestion pipeline might be orchestrated with a workflow engine or message queue to handle continuous updates at scale.
Preprocessing and cleaning: Once raw data is collected, it passes through cleaning and normalization. This includes steps like:
- Segmentation (Chunking): Split documents into semantically coherent chunks suitable for embedding. A common approach is splitting by a maximum character length (e.g. 500 tokens), but we found that respecting natural document structure yields better results. Using headings, paragraphs, or semantic units avoids breaking context mid-sentence. For example, DeepSeek’s pipeline applies element-based chunking, preserving section headings and combining small subsections so that each chunk is a self-contained idea. We also use a slight overlap between chunks to avoid losing context at boundaries (e.g. repeating a sentence or two between adjacent chunks).
- Metadata enrichment: As we process each document, we attach metadata that will be useful later. This can include the source (e.g. file path or URL), document title, creation date, author, content type, and access permissions. Metadata is invaluable for filtering and ranking results at query time. For instance, if a user asks “Show me HR policies,” we can boost or filter results that have
category: HR Policyin their metadata.
To illustrate, here is a simplified example of DeepSeek’s document parsing and chunking logic, using a PDF loader and a text splitter that preserves structure:
from langchain_community.document_loaders import PyPDFLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterdef ingest_document(file_path): # Load the document (e.g., PDF) into text loader = PyPDFLoader(file_path) documents = loader.load() # returns a list of Document objects with metadata # Split into chunks with a size and overlap, using multiple separators to respect structure text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=[“\n\n”, “\n”, “ “, “”] ) chunks = text_splitter.split_documents(documents) # Example: add custom metadata for chunk in chunks: chunk.metadata[“source_file”] = file_path chunk.metadata[“ingested_at”] = datetime.utcnow().isoformat() return chunks# Ingest a batch of filesall_chunks = []for file in list_enterprise_files(“/data/policies”): all_chunks.extend(ingest_document(file))print(f”Ingested {len(all_chunks)} chunks from policy documents.”)plitter import RecursiveCharacterTextSplitter def ingest_document(file_path): # Load the document (e.g., PDF) into text loader = PyPDFLoader(file_path) documents = loader.load() # returns a list of Document objects with metadata # Split into chunks with a size and overlap, using multiple separators to respect structure text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", " ", ""] ) chunks = text_splitter.split_documents(documents) # Example: add custom metadata for chunk in chunks: chunk.metadata["source_file"] = file_path chunk.metadata["ingested_at"] = datetime.utcnow().isoformat() return chunks # Ingest a batch of files all_chunks = [] for file in list_enterprise_files("/data/policies"): all_chunks.extend(ingest_document(file)) print(f"Ingested {len(all_chunks)} chunks from policy documents.")
In this snippet, we load a PDF, split it into roughly 500-character overlapping chunks, and tag each chunk with metadata. In a real pipeline, similar logic would apply to other content types (HTML pages, emails, etc.), possibly using specialized loaders for each. The result of ingestion is a collection of thousands (or millions) of text chunks, each with associated metadata, ready to be embedded and indexed.
Vectorization (Embedding) and storage: After chunking, each piece of text needs to be transformed into a numerical vector for similarity search. DeepSeek Copilot employs a high-dimensional embedding model to generate vectors that capture semantic meaning. Initially, we used OpenAI’s text-embedding-ada-002 model for convenience, which produced 1536-dimensional vectors. This ensured strong semantic search out-of-the-box, but sending sensitive text to an external API was not ideal for our compliance needs. We later migrated to an in-house embedding model based on SentenceTransformers (fine-tuned on our domain data), so all embeddings are generated internally. Domain-specific fine-tuning of embeddings can noticeably boost accuracy for specialized jargon and phrasing.
Once we have vector representations, they are upserted into a vector database. The choice of vector store is crucial for scale. For moderate volumes or prototypes, using an in-memory library like FAISS can be sufficient (and indeed FAISS is used under the hood by many vector DBs). But at enterprise scale — where we might index millions of chunks and need sub-second query latency — a dedicated vector database is recommended. We selected Weaviate as our vector DB for DeepSeek Copilot due to its scalability and hybrid search support (more on that shortly), but other excellent options include Pinecone, Milvus, and OpenSearch’s vector engine. The vector DB enables fast nearest-neighbor search in the embedding space, typically using approximate algorithms (e.g., HNSW indexing) to handle large data volumes efficiently.
The ingestion pipeline, therefore, culminates in populating the vector store: each chunk’s vector along with its metadata is stored and indexed for future retrieval. This entire process is orchestrated as a batch job that can run nightly or continuously. In production, you’d containerize this pipeline and perhaps use Kubernetes CronJobs or workflow schedulers to manage regular re-indexing. DeepSeek Copilot’s indexing jobs also produce logs and metrics (e.g., number of documents processed, any errors) so that data engineers can monitor the freshness and health of the knowledge base.
Key best practices in ingestion:
- Ensure data quality and structure: Maintain document hierarchies and clean text to improve downstream retrieval.
- Right-size the chunks: Aim for chunks that are neither too large (risking dilution of relevance) nor too small (losing context). 300–500 tokens with overlap is a common sweet spot, but adjust for your domain.
- Enrich with metadata: Tag chunks with useful metadata (source, date, access level) at ingestion time to enable filtered and context-aware search later.
- Automate continuous ingestion: Business data is not static — build pipelines that can update the index incrementally as new information arrives, keeping the RAG system’s knowledge up-to-date.
With our enterprise knowledge now indexed in a vector database, we can turn to the online query path: how DeepSeek Copilot retrieves relevant information for a user’s question in real time.
Retrieval and Relevance: Finding the Right Information Fast
When a user poses a question to DeepSeek Copilot (for example, “What is our parental leave policy?”), the system must rapidly sift through the indexed vectors to find the most relevant pieces of information to help answer that question. This retrieval step is the backbone of RAG — it ensures the language model has the right grounding context to generate an accurate answer. In designing the retrieval pipeline, we focused on precision, recall, and speed:
- Precision: retrieved chunks should be highly relevant to the query, to maximize the chances the answer will be correct.
- Recall: we don’t want to miss important information; the system should surface relevant content even if phrased differently or in unexpected documents.
- Speed: the search has to be efficient, as it lies in the critical path of user query latency.
Achieving all three can be challenging, so DeepSeek Copilot employs multiple strategies to balance them. At a high level, our retrieval pipeline involves embedding the user query, performing vector similarity search in the vector database, optionally blending in lexical search, and then ranking/filtering the results.
1. Query embedding and vector search: Upon receiving a query, the first step is to generate its embedding vector using the same encoder model used for the documents. For example, with a SentenceTransformer model:
query = "What is our parental leave policy?" query_vec = embedder.encode([query])[0] # produces a 768-dim or 1536-dim vector
This vector representation captures the semantic gist of the question. We then issue a similarity search against the vector DB, asking for the top K nearest neighbor vectors to query_vec. The vector database returns, say, the top 10 chunks most similar to the query. Each chunk comes with its original text and metadata (like which document it came from). In Weaviate or Pinecone, this is a single API call (e.g., index.query(vector=query_vec, top_k=10)).
Under the hood, the vector DB uses an approximate nearest neighbor index (like HNSW) to find results in milliseconds even if we have millions of vectors. This pure vector search works great for many queries, especially those that are conceptual or phrased differently than the source text. For example, a query “details on maternity leave duration” can still match a chunk containing “Parental leave is up to 12 weeks…” because semantically they are close, even if exact words differ.
2. Hybrid retrieval (vector + keyword): While embeddings are powerful, they are not infallible. Sometimes a user’s query contains specific keywords, codes, or phrases that we want to match exactly. Or a query might be very broad, where purely semantic search can return tangential results. To address this, DeepSeek Copilot adopts a hybrid retrieval approach: combining vector similarity with traditional lexical search (BM25 keyword matching). We integrated this using Weaviate’s hybrid search feature, which allows blending the two methods. In essence, the query is run both through vector similarity and through an inverted index, and results are merged. We can weight the contribution of semantic vs. lexical scores via a parameter (often called alpha). An example call using a hybrid search might look like:
def retrieve_relevant_chunks(user_query, top_k=10): # 1. Embed the query q_vec = embedder.encode([user_query])[0] # 2. Vector search in the index raw_results = vector_index.query(q_vec, top_k=top_k*2) # get extra for hybrid/rerank # (Optional) Hybrid augmentation: also get BM25 hits and merge bm25_results = text_index.search(user_query, top_k=top_k*2) candidates = merge_results(raw_results, bm25_results) # 3. Filter by access permissions (example: ensure user has access to doc) user_allowed = [doc for doc in candidates if authorize(user, doc.metadata)] # 4. Rank candidates by combined score or via cross-encoder (simplified here) ranked = sorted(user_allowed, key=lambda doc: doc.score, reverse=True)return ranked[:top_k]
- In this snippet,
alpha=0.5gives equal weight to keyword and vector similarities. A higher alpha leans more on semantic similarity, while a lower alpha relies more on exact keyword matching. Hybrid retrieval proved useful in our testing. For instance, if someone asks about a policy by an older name or acronym, the lexical component can catch that, whereas pure embedding search might miss it. Conversely, if the query is vague, the semantic component ensures we still retrieve relevant context. If you’re not using a vector DB that natively supports hybrid search, you can implement it manually: e.g., query an Elasticsearch index for keywords and merge those results with the vector results by re-scoring or interleaving.
3. Preliminary ranking and filtering: The vector DB returns a set of candidate chunks. We often get more than we ultimately want to pass to the language model (which has context length limitations). For example, we might retrieve 10–20 chunks, but only include the top 3–5 in the final prompt to the LLM to avoid overloading it. Thus, we need a ranking step to sort the candidates by relevance. By default, the similarity score from the vector search provides a ranking, but we can refine it:
- We apply a simple heuristic boost if multiple chunks come from the same document (assuming that if one chunk was relevant, others from the same source might be too — though we also want diverse coverage, so we handle this carefully).
- We might downrank chunks that are very similar to each other to avoid redundancy.
- If access control is a concern, this is where we filter out any chunks the user shouldn’t see (using the metadata tags for permissions).
For high-recall applications, an advanced approach is to use a cross-encoder re-ranker — a secondary model (like a fine-tuned BERT) that takes the query and each candidate chunk and outputs a relevance score. This can significantly improve precision for nuanced queries, at the cost of some extra latency. In DeepSeek Copilot, we experimented with a cross-encoder for very important queries (it improved quality on long, complex questions), but in many cases the vector similarity alone, given good embeddings, was sufficient. Organizations with extremely high accuracy requirements might incorporate such a re-ranker into the pipeline.
Example — retrieving and ranking: Putting it together, a simplified retrieval function in DeepSeek Copilot looks like this:
from langfuse import Langfuse, observelangfuse = Langfuse(api_key=API_KEY, host_url=”https://cloud.langfuse.com")@observe() # this will create a span for the model generationdef generate_answer(prompt, model=”gpt-4"): start = time.time() # Call the LLM (could be OpenAI, Azure, or local model) completion = llm_client.generate(model=model, prompt=prompt) answer = completion.text # Log token usage and latency usage = { “prompt_tokens”: completion.prompt_tokens, “response_tokens”: completion.response_tokens, “total_cost”: completion.cost # if available, cost in dollars } # Attach usage stats to the current trace span langfuse.context.update_current_observation(usage=usage, model=model) duration = time.time() — start print(f”Generated answer in {duration:.2f}s, model={model}”) return answer# Example of how a full query handling function might tie together observed steps@observe()def answer_query(user_id, query): relevant = retrieve_relevant_chunks(query) prompt = build_prompt(relevant, query) answer = generate_answer(prompt) return answer
This pseudo-code demonstrates a multi-step retrieval: we get vector-based results, optionally mix in lexical results, enforce a security filter ( authorize), then sort by score. The output is the top K chunks that will go into the prompt for answer generation.
Optimizing for speed: For interactive systems, retrieval must be very fast (tens of milliseconds ideally). Some optimizations we applied include:
- Pre-normalizing queries and documents (e.g., lowercasing, removing stopwords for BM25) so that search is efficient.
- Ensuring the vector index is properly indexed in memory — Weaviate and similar systems typically keep the vector graph in RAM for speed. We monitor memory usage to size the hardware appropriately.
- Using parallelism: if we do both vector and keyword searches, we execute them in parallel threads and then combine results.
- Tuning the
top_k: we don’t want to fetch too many results unnecessarily. Fetch a bit more than you need for final context (to allow some room for reranking), but avoid very large K which would slow down the query.
With relevant context chunks retrieved, DeepSeek Copilot is ready to feed them into the generative model. But before we move on, it’s worth noting that effective retrieval is one of the best defenses against hallucination. If the system finds the correct answer in the documents, that’s half the battle — the model then just needs to present it. Conversely, if the answer isn’t in the index, no amount of prompt engineering will fully save you. Next, we discuss how we guide the model to use this retrieved information and nothing else.
Mitigating Hallucinations and Ensuring Factual Accuracy
One of the greatest concerns for enterprises adopting generative AI is the phenomenon of hallucination — the AI model fabricating an answer that sounds plausible but is actually incorrect or not grounded in the data. In an enterprise context, hallucinations can be more than just embarrassing; they can lead to misinformation given to customers or bad internal decisions. DeepSeek Copilot’s entire design is oriented toward minimizing hallucinations: the RAG approach itself is a mitigation, and we add extra guardrails to further reduce the chance of an ungrounded response.
Retrieval as the primary grounding: By supplying the language model with relevant excerpts from our knowledge base, we dramatically improve factual accuracy. The model is no longer forced to rely on parametric memory alone; instead, it has the company’s actual documents at its fingertips. Empirically, we found that when the retrieved chunks indeed contain the answer, the model will usually incorporate that text and get the answer right (often paraphrasing or summarizing from the sources). This aligns with the “open-book test” analogy — the model performs much better when allowed to reference real data. However, if the retrieval step fails (either nothing relevant was found, or the query is outside the scope of the knowledge base), the model might revert to its training data or guess, which is dangerous. Our strategy to handle this has several facets:
- Prompt grounding and instructions: We use carefully designed prompt templates when calling the LLM. The prompt always includes a system message that explicitly instructs the model to use the provided context and not to invent information. For example, our system message says “You are an enterprise assistant with access to internal knowledge. Answer the user’s question truthfully using the provided documents. If you are not confident, say you cannot find the answer.” Then we insert the retrieved chunks (with clear separators or labels) followed by the user’s question. By reminding the model of its role and limitations (only use the given data), we reduce creative embellishments. Retrieval-augmented generation literature often calls this grounded prompting, and it’s a key technique to keep the model on the rails.
- Limiting the temperature (creativity): We typically call the LLM with a relatively low
temperaturesetting (e.g., 0 or 0.2) for these factual Q&A prompts. A high temperature can induce more diverse or creative outputs - great for brainstorming, but not for factual recall. With temperature zero, the model behaves more deterministically and sticks closer to the data it was given. - Answer verification and refusal: DeepSeek Copilot is programmed to not answer when it doesn’t have high-confidence support from documents. If the retrieval pipeline returns nothing useful (for instance, the similarity scores were all below a threshold, or after reading the top passages the model still isn’t sure), we prefer the assistant to admit it doesn’t know. This is crucial: “I’m sorry, I don’t have that information.” is a far better outcome than confidently giving a wrong answer. Implementing this involves two pieces: 1. We set a similarity score cutoff: if the top result’s relevance score is very low, the system treats the query as unsupported. We also consider metadata — e.g., if user asks about a very recent event and our data is outdated. 2. We include a conditional in the prompt or in post-processing: if the model’s draft answer does not appear to use the provided sources or seems to contradict them, we can trigger a second pass or a safe completion. In practice, detecting this automatically is hard, but we do parse the answer for any reference to unknown facts. A simpler approach is we encourage the model (through prompt) to output a brief citation or reference to the sources for each answer. If it’s unable to do so, that’s a red flag.
- Multi-step interaction for clarification: Sometimes hallucinations occur from misunderstanding the query. If the question is ambiguous or too broad, instead of guessing, the copilot can ask a clarifying question. This turns one prompt-response into a short dialog, but it’s worthwhile if it avoids a wrong answer. We allow the system to respond with a question like, “I have information on several topics related to that. Could you clarify what specifically you’re looking for?” rather than forcing an answer. This design aligns with good conversational AI practice and safety.
Despite all precautions, no system is 100% free of mistakes. But our experience has shown that a well-built RAG system can significantly reduce hallucinations compared to a standalone LLM. In internal tests, DeepSeek Copilot answered over 85% of factual queries with perfect accuracy, and when it didn’t know something, it correctly deferred or asked for clarification in the majority of cases. Compare this to a vanilla LLM on the same queries, which might confidently answer incorrectly 30% of the time. RAG isn’t magic — if your documents themselves have errors, the model might reflect those — but it anchors the model’s output in reality.
As a final note, an often overlooked aspect of factuality is maintaining up-to-date data. A hallucination can occur simply because the model is unaware of a recent change (e.g., a policy updated last week). Our ingestion pipeline’s continuous updates help address this, but it requires organizational discipline to feed new data promptly. In essence, the RAG system is only as current as the data you’ve given it. DeepSeek Copilot’s design, with automated ingestion and clear “I don’t know” behavior for out-of-index queries, ensures that if something isn’t in the knowledge base, the user is not misled — instead, the system can escalate the question or suggest an action (like contacting a human expert or adding new content for next time).
Having covered data and logic, we now turn to the equally important aspect of operationalizing this system. How do we monitor it in production, scale it to many users, and protect the data it’s built on?
Observability and Monitoring in Production
A production-grade AI assistant must be treated like any mission-critical service: we need to monitor its health, performance, and usage continuously. In DeepSeek Copilot, we bake in observability from day one. This allows us to answer questions such as: Are queries being answered within our latency targets? What’s the failure rate of the retrieval calls? How often does the LLM refuse to answer due to low confidence? Which parts of the pipeline are bottlenecks? Moreover, for an AI system, observability isn’t just about infrastructure metrics — it’s also about tracing the decision process for each query, which is key for debugging and auditing the AI’s behavior.
Logging and tracing the pipeline: DeepSeek’s architecture has multiple components (ingestion, embedding service, vector DB, LLM service, etc.) that form a pipeline for each user query. We implement distributed tracing so that a single user query can be tracked across services. Each query gets a unique trace ID. As it flows through the system — vector search, model generation, etc. — each step logs an event with that trace ID. We leverage an open-source LLM tracing platform called Langfuse to record these events and visualize traces. For instance, when the LLM generates a response, we capture metadata like the prompt tokens, output tokens, model name, and even the embedding IDs of context used. Langfuse’s Python SDK provides a convenient decorator @observe() that we wrap around functions in the pipeline to automatically log their execution as a span in the trace. Below is an illustrative example of how we instrument the generation step:
from langfuse import Langfuse, observelangfuse = Langfuse(api_key=API_KEY, host_url=”https://cloud.langfuse.com")@observe() # this will create a span for the model generationdef generate_answer(prompt, model=”gpt-4"): start = time.time() # Call the LLM (could be OpenAI, Azure, or local model) completion = llm_client.generate(model=model, prompt=prompt) answer = completion.text # Log token usage and latency usage = { “prompt_tokens”: completion.prompt_tokens, “response_tokens”: completion.response_tokens, “total_cost”: completion.cost # if available, cost in dollars } # Attach usage stats to the current trace span langfuse.context.update_current_observation(usage=usage, model=model) duration = time.time() — start print(f”Generated answer in {duration:.2f}s, model={model}”) return answer# Example of how a full query handling function might tie together observed steps@observe()def answer_query(user_id, query): relevant = retrieve_relevant_chunks(query) prompt = build_prompt(relevant, query) answer = generate_answer(prompt) return answer
In this snippet, any call to answer_query will produce a structured trace consisting of sub-spans for retrieval ( retrieve_relevant_chunks), prompt building, and the LLM generation. The Langfuse observer automatically times these and logs them to a central dashboard. We explicitly update the generation span with token usage and model info, which Langfuse will record. Having such detailed traces means that if a user reports an incorrect answer, we can inspect exactly what documents were retrieved and what prompt was given to the model - invaluable for debugging. It also helps identify issues like, say, the retrieval took 5 seconds (maybe our vector DB is overloaded) or the LLM call hit a timeout or rate limit error.
Metrics collection: In addition to tracing individual queries, we collect aggregate metrics:
- Latency metrics: p95 and p99 latency for the end-to-end answer, as well as breakdown by stages (e.g., median vector search time, LLM generation time). This is instrumented via Prometheus metrics in each service (e.g., a histogram for query latency). We set SLOs, such as “95% of questions answered in <2 seconds”.
- Throughput and usage: number of queries per minute, number of active users, etc. We also track token consumption over time (useful to forecast API costs when using external models).
- Error rates: any errors or exceptions in the pipeline are logged and counted. For example, vector DB query failures, timeouts from the LLM API, or parsing errors in ingestion are all recorded. We want to catch issues early (e.g., if the embedding service goes down and all queries start failing at the retrieval step).
- Answer quality signals: this one is harder to measure automatically, but we approximate by capturing user feedback when available. If the application UI has a thumbs-up/down or if users rephrase the question (could indicate the first answer wasn’t satisfying), we log those events too.
We use a combination of Grafana dashboards (for numeric metrics like latency, QPS) and the Langfuse UI (for trace-level inspection) to monitor DeepSeek Copilot. The observability stack is just as critical as the AI model — it gives us confidence to deploy and maintain the system. As one practitioner noted, advanced observability provides a centralized view of performance metrics (response times, error rates, usage) so issues can be identified and addressed quickly. In our case, early on we noticed a spike in errors via our tracing system — the cause was an upstream data source that changed format and was breaking the ingestion parser. Because we had good logs, we pinpointed the issue and fixed the parsing logic before it impacted many users.
Another aspect of observability is auditing. In certain regulated industries, it’s important to have a record of what information was provided by the AI, especially if it could influence decisions. Our logging ensures we have a history of Q&A pairs (and their source context) that can be reviewed if needed. We also take care to scrub or protect any personally identifiable information in logs, since queries might contain such data.
In summary, by instrumenting DeepSeek Copilot with thorough logging, tracing, and metrics, we gain real-time insights and the ability to debug and improve the system continuously. This reduces the operational risk of deploying a complex AI service.
Scalability and Performance Engineering
A successful enterprise copilot might start with a pilot for one team, but if it proves useful, usage can skyrocket as it’s rolled out company-wide or extended to customer channels. DeepSeek Copilot was built with scalability in mind — both in terms of handling a high volume of queries and accommodating growth in the underlying knowledge base. We leveraged modern cloud architecture practices to ensure the system can scale horizontally and remain reliable under load.
Microservices and containerization: We split the RAG system into decoupled components that can be scaled independently. The major services in DeepSeek Copilot include:
- Ingestion Service — handles document processing and indexing (offline, scalable as needed for throughput).
- Vector Index Service — encapsulates the vector database (Weaviate instance). This can be a clustered deployment for sharding data if needed.
- Embedding Service — an API that generates embeddings for new queries (and potentially for documents, though those are usually done offline). This could wrap a GPU-based model or call an external embedding API.
- Retrieval API — a service that accepts user queries, calls the vector DB (and maybe a keyword search index), and returns retrieved texts.
- LLM Generation Service — a service (could be just a wrapper around an external API, or a hosted model server) that takes the final prompt and returns the LLM’s answer.
- Coordinator or API Gateway — in our case, the main application server that orchestrates the above for each request and implements any business logic, authentication, etc. For instance, a FastAPI or Flask app that your chatbot frontend calls, which internally calls the Retrieval API then the LLM service and so forth.
By containerizing each of these (using Docker images) and deploying them to Kubernetes, we gain a lot of flexibility. We can scale out the components that are bottlenecks — for example, if the LLM service is CPU/GPU intensive, we might run multiple replicas behind a load balancer. Kubernetes makes it straightforward to define deployments for each service, and we configured Horizontal Pod Autoscalers (HPA) to adjust replica counts based on metrics. For instance, the HPA might keep the LLM service at 2 pods normally but scale up to 6 pods if CPU or GPU utilization stays high for a few minutes during peak load.
Kubernetes deployment example: Here is a fragment of a Kubernetes deployment (in YAML) for the DeepSeek LLM Service using a hypothetical local model:
apiVersion: apps/v1kind: Deploymentmetadata: name: deepseek-llmspec: replicas: 2 selector: matchLabels: app: deepseek-llm template: metadata: labels: app: deepseek-llm spec: containers: - name: llm-server image: myregistry/deepseek-llm:latest resources: limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: 1 # request a GPU if using a GPU model env: - name: MODEL_NAME value: "vicuna-13b-v1.5" ports: - containerPort: 8000 readinessProbe: httpGet: path: /healthz port: 8000 initialDelaySeconds: 5 periodSeconds: 15
In this deployment spec, we set up 2 replicas of an LLM server (which could be running a prototype model like Vicuna 13B or invoking an external API). We include resource limits (if using GPUs, we specify the GPU resource), and we define a readinessProbe so Kubernetes knows when a pod is ready to receive traffic. Similar manifests exist for the other components like the Retrieval API and the vector DB (e.g., Weaviate can run in a Deployment with persistent volume claims for its data). We also use Kubernetes Secrets to store sensitive config like API keys (for calling OpenAI or others) and mount them as environment variables, rather than hardcoding credentials. This aligns with best practices to secure configuration.
Infrastructure as Code and CI/CD: We treat the deployment configuration as code (using Helm charts and Terraform for cloud resources). The entire DeepSeek Copilot stack can be deployed reproducibly on a new cluster with a Helm release — our charts define all the Kubernetes objects (deployments, services, HPAs, etc.) for each microservice. Terraform scripts provision any required cloud services, such as a managed database or blob storage for document files, and set up networking (ingress, DNS entries for the API endpoint, etc.). This approach is crucial for consistency between environments (dev, staging, prod) and for quickly setting up new installations of the system. Whenever we update the code (say we improve the retrieval logic), our CI pipeline runs tests (including integration tests where possible), builds new Docker images, and updates the Helm chart. Through GitOps or a continuous deploy pipeline, these changes roll out to the Kubernetes cluster. By investing in these deployment artifacts, we ensure that improvements to the RAG system can be delivered to production reliably and frequently — not stuck in “lab” because operations is too hard. In enterprise scenarios, having Terraform and Helm artifacts ready can also ease the path for internal approval, since infrastructure teams can review exactly what will be deployed.
Scalability of the vector store: As the document corpus grows, the vector database needs to scale as well. We chose a horizontally scalable vector DB (Weaviate offers sharding and replication), so we can add nodes to the cluster if our vector count goes from, say, 1 million to 100 million. Alternatively, one could partition data by topic or team into multiple indexes if that makes semantic sense, though this complicates retrieval slightly. It’s important to monitor vector search performance as data grows — we track the query latency and index refresh times. If we started with a smaller setup (one instance on modest hardware), we remain ready to migrate to a more powerful configuration or a managed service to keep latency low. Techniques like clustering or IVF (inverted file index for vectors) can also be applied to keep ANN search efficient at scale, if the chosen DB supports them. In our case, scaling up to a cluster of 3 nodes with HNSW index was sufficient for our needs, maintaining sub-100ms retrieval times even as the index size doubled.
Concurrency and throughput considerations: Scaling horizontally helps handle more concurrent queries. Each LLM generation can be time-consuming (hundreds of milliseconds to a few seconds), and if using a large model, it may not handle many parallel requests on one instance. By load-balancing across multiple instances or using a multi-threaded model server that can handle concurrent requests, we increased throughput. We also utilize request queues and timeouts — if a surge of queries comes in, we queue them with a capped waiting time to avoid overwhelming the system or incurring huge tail latencies. In some cases, we degrade gracefully: for example, if the vector search is slow, we might return a message like “System is busy, please retry” rather than timing out after a long wait.
Geographical scaling: For global organizations, one might deploy instances of the RAG system in multiple regions (to serve local offices with lower latency and to meet data residency requirements). DeepSeek Copilot’s design, being containerized and coded as infrastructure, would allow deploying separate stacks in different regions that each index the relevant data (or a global data store if legally permissible). This ensures users everywhere get a snappy experience.
Caching opportunities: As usage patterns emerge, we can introduce caching for frequent queries. For instance, if many users ask “What’s the holiday schedule?”, the answer (once generated) can be cached for a short period (say 1 hour) and reused, since the policy doesn’t change often. We built a simple in-memory cache at the API layer for exact query matches, which helped reduce load on the LLM for extremely common queries. Of course, cache invalidation must be tied to content updates (if the holiday schedule changes, flush that cache entry). This is a minor optimization but worth mentioning for completeness.
In practice, our production deployment of DeepSeek Copilot runs on a Kubernetes cluster with autoscaling, and we’ve observed it handle increases in load smoothly by adding pods. We also simulate peak loads (load testing) to verify the system can scale to the expected user base. The result is a system that can grow from handling dozens of queries per day to thousands per hour without major re-architecture — we just allocate more resources and let the orchestrator manage them.
Security and Compliance Considerations
Enterprise applications live in a world of strict security, privacy, and compliance requirements. An AI copilot is no exception — in fact, it raises unique concerns since it deals with potentially sensitive internal knowledge and user queries that might contain confidential information. When we say “without compromise,” we also mean that DeepSeek Copilot does not compromise on the organization’s security policies. From data ingestion to model serving, every layer is designed with safeguards.
Data privacy and access control: One key advantage of building a RAG system in-house is that all proprietary data stays within the company’s control. DeepSeek Copilot’s pipeline runs in our private cloud environment; documents are stored in secured storage and the vector database, and we chose to use a self-hosted LLM model for generation (after initial prototyping) so that we are not sending sensitive prompts to an external API. This approach ensures compliance with data privacy regulations and internal policies. For example, if there are documents that contain personal employee data or trade secrets, they never leave our isolated network. In cases where we do use third-party APIs (like OpenAI for embeddings in early stages), we vet the provider’s data handling policies and often use data anonymization or encryption for the content being sent if possible.
At query time, access control is enforced. Not every user should be able to retrieve every document indexed. We include user identity (from the authentication token or session) and check it against document permissions metadata. The retrieval API filters results by user entitlements — if a chunk is from a document the querying user has no rights to (e.g., a finance report restricted to CFO office), it will be skipped. This is critical in a multi-user enterprise scenario; otherwise the RAG system could become a loophole to access unauthorized information. Our vector database supports metadata filtering queries, so we simply add a filter like where "access_level" IN user.roles in the vector search. Additionally, we isolate certain highly sensitive datasets into separate indexes entirely, which only a specific service account can query when needed. These measures align with the principle of least privilege.
Authentication and API security: The front-end interface (whether it’s a chat UI or an API endpoint that integrators call) is secured via our single sign-on (SSO). Users must be authenticated to interact with the copilot, ensuring that we can enforce the access controls as mentioned. We log all query access with user IDs for audit purposes. If the copilot is exposed via an API, we use secure tokens and rate limiting to prevent abuse or DDoS-like scenarios. We also ensure all network communication is encrypted (HTTPS for all service calls). Within Kubernetes, services communicate over the cluster network; for additional security, one could use service mesh encryption or at least network policies to restrict which pods can talk to the vector DB, etc., reducing the blast radius if any component were compromised.
Prompt injection and output filtering: A newer security issue particular to AI systems is prompt injection, where a user might try to manipulate the model with crafted inputs to reveal confidential info or bypass safety instructions. For DeepSeek Copilot, our use of RAG actually helps here — because we largely constrain the model to use provided context, it’s less likely to go off-script due to a malicious prompt. Nonetheless, we implemented a few safeguards:
- We sanitize user input to remove any hidden instructions that could interfere with our system prompt. Essentially, our system prompt is always prepended, and user input is treated as user content only.
- The model we use (or the OpenAI API) has its own content filters. We ensure they remain enabled to block certain categories of output (hate speech, etc.) should that ever be triggered.
- We added a post-processing step that examines the answer before returning it to the user. If the answer contains certain sensitive patterns (for instance, it tries to dump a large chunk of raw internal data or code that wasn’t asked for, possibly indicating an injection attempt), we either redact that or refuse the answer. Such cases are fortunately very rare in a closed-domain assistant, but we remain vigilant.
Compliance logging and policies: For industries under compliance regimes (HIPAA, GDPR, etc.), it’s important to know what data is being used and how. Our ingestion process can tag data with classifications (public, confidential, secret) and the system can be configured to exclude certain classes from being indexed at all if needed. If a user query asks for personal data, the system can either refuse or mask the answer according to privacy rules. For GDPR compliance, if an EU user’s data is in the index, we have procedures to delete those vectors upon request (and re-index). All these tie into general data governance; the RAG system should be an extension of your existing data governance practices. We consulted our compliance team while building DeepSeek Copilot to ensure, for example, that logs of user queries don’t inadvertently become a data leakage vector (hence, we might omit or hash PII in logs).
Built-in security vs. custom: It’s worth noting that when you build a custom RAG solution like this, you shoulder the responsibility for implementing security and compliance controls. Many commercial RAG platforms tout built-in security features — indeed, if you buy a solution, it may come with certifications (SOC2, ISO27001) and admin dashboards for access management out-of-the-box. In our case, we leveraged our existing security infrastructure (SSO, cloud security groups, etc.) to meet those needs, but it did require careful design. We consider the extra effort worthwhile for the level of control and customization we get. Each enterprise must weigh this: do you have the in-house expertise to secure the system end-to-end? If yes, building offers ultimate flexibility. If not, a vendor product might fill the gaps but at the cost of black-box elements.
At the end of the day, DeepSeek Enterprise Copilot operates within our trusted environment, on our data, serving our users, with multiple layers of defense to prevent misuse. We achieved a system that adheres to our infosec checklist: data encryption at rest and in transit, fine-grained access control, audit logging, and compliance alignment. With that, we can confidently deploy it in production.
Build vs. Buy: Weighing the Options for Enterprise RAG
Given the complexity we’ve described, a natural question for any technology leader is: does it make sense to build our own RAG platform like DeepSeek Copilot, or should we purchase a commercial solution? This is a classic Build vs. Buy dilemma, and the answer isn’t one-size-fits-all. It depends on your organization’s priorities, resources, and use case. We’ll outline the major considerations to help you make an informed decision.
**Advantages of building i