The Wake-Up Call
I deployed my first AI system to production. Within the first hour, it crashed several times.
The error logs were a nightmare: relation "documents" does not exist, Dense vectors must contain at least one non-zero value, 429 Too Many Requests, CORS policy: No βAccess-Control-Allow-Originβ. Every fix revealed three new problems.
Most RAG tutorials end at "it works on localhost." They skip the brutal reality: rate limits, CORS hell, database migrations, API quota exhaustion, and the 3 AM debugging sessions that come with real production systems.
This isnβt that kind of tutorial.
Iβm Blessing, a junior AI engineer from Lagos, Nigeria. This was my first production AI system, and I documented every failure, every panic moment, and every "why didnβt the tutoβ¦
The Wake-Up Call
I deployed my first AI system to production. Within the first hour, it crashed several times.
The error logs were a nightmare: relation "documents" does not exist, Dense vectors must contain at least one non-zero value, 429 Too Many Requests, CORS policy: No βAccess-Control-Allow-Originβ. Every fix revealed three new problems.
Most RAG tutorials end at "it works on localhost." They skip the brutal reality: rate limits, CORS hell, database migrations, API quota exhaustion, and the 3 AM debugging sessions that come with real production systems.
This isnβt that kind of tutorial.
Iβm Blessing, a junior AI engineer from Lagos, Nigeria. This was my first production AI system, and I documented every failure, every panic moment, and every "why didnβt the tutorial mention THIS?" frustration.
Hereβs what youβll learn:
Why my embeddings worked locally but failed in production The cascade of failures that happens when one service hits quota How I went from "no relevant information found" on every query to 90% success rate Real code and architecture decisions (not theory) Actual production metrics and costs
If youβre building your first production AI system, this post might save you 47 crashes and countless hours of debugging. Letβs dive into what actually happened.
What I Built (And Why It Matters) The System: A RAG (Retrieval-Augmented Generation) Document Q&A application where users upload PDFs, DOCX, or TXT files, then ask questions in plain English and get AI-generated answers with source citations. Why RAG? Traditional LLMs hallucinate - they confidently make things up. RAG solves this by grounding responses in YOUR actual documents. Upload your companyβs 500-page policy manual, ask "Whatβs our remote work policy?" and get an accurate answer with the exact page reference.
Real-world impact: Instead of Ctrl+F through dozens of files, users get conversational answers in 2-4 seconds.
Try it live: @URL
The Tech Stack (And Why I Chose Each) Frontend:
React + TypeScript + Tailwind CSS Deployed on Vercel Why: Fast dev experience, automatic deployments, global CDN
Backend:
FastAPI (Python) Deployed on Railway Why: Async support, automatic API docs, simpler than AWS
Databases:
PostgreSQL (document metadata) Pinecone (vector embeddings) Why: Pinecone serverless = no infrastructure management
AI Services:
Google Gemini 2.0 Flash (answer generation) Cohere embed-v3 (embeddings) Why: Geminiβs free tier (15K requests/month) vs OpenAIβs limited free trial
Authentication:
Clerk (JWT-based) Why: Drop-in solution, handles edge cases
The Architecture βββββββββββββββ β User β ββββββββ¬βββββββ β βΌ ββββββββββββββββββββββββ β React Frontend β β Vercel β (TypeScript) β ββββββββββ¬ββββββββββββββ β HTTPS + JWT βΌ ββββββββββββββββββββββββ β FastAPI Backend β β Railway β (Async Python) β ββββββ¬βββββββ¬βββββββ¬ββββ β β β βΌ βΌ βΌ βββββββββββββββββββββββββββββββββββ βPinecone ββPostgreSQLββVirusTotalβ β Vectors ββ Docs ββ Scanner β βββββββββββββββββββββββββββββββββββ β βΌ βββββββββββββββββββββββ β Gemini (primary) β β Cohere (fallback) β βββββββββββββββββββββββ The Flow:
User uploads document β Virus scan β PostgreSQL record Background task extracts text β Chunks (1000 chars, 100 overlap) Gemini generates embeddings (768-dim vectors) Store in Pinecone with metadata User asks question β Gemini embeds query Pinecone finds top 5 similar chunks (cosine similarity) Gemini generates answer from retrieved context Return answer with source citations
Simple in theory. Brutal in practice.
Crash #1: "Dense Vectors Must Contain Non-Zero Values"
What happened: My first upload to Pinecone failed instantly. Error: Dense vectors must contain at least one non-zero value
The mistake: I was using dummy embeddings for testing:
β WRONG - What I did initially
embeddings = [[0.0] * 768 for _ in chunks] Pinecone rejected them because zero vectors have no semantic meaning - you canβt calculate similarity with nothing.
What I tried:
Used Google Gemini embeddings β Hit quota limit (1500/day free tier had... 0 available) Switched to Cohere β Hit their 96 text limit per request Tried batch processing β Hit 100K tokens/minute rate limit
The solution: def generate_embeddings(self, texts: List[str]) -> List[List[float]]: """Generate embeddings with batching and rate limiting""" all_embeddings = [] batch_size = 96 # Cohereβs limit
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = self.cohere_client.embed(
texts=batch,
model='embed-english-v3.0',
input_type='search_document',
embedding_types=['float']
)
all_embeddings.extend(response.embeddings.float_)
# Rate limiting: 6 second delay between batches
if i + batch_size < len(texts):
time.sleep(6)
return all_embeddings
Result: Successfully processed 1000-chunk documents in ~60 seconds. Lesson: Always test with real API responses, not mocked data. Dummy values that work locally will fail in production.
Crash #2: "No Relevant Information Found" (The Cascade)
What happened: Every single query returned "no relevant information found" despite successful uploads. This was the most frustrating bug. Documents uploaded fine. No errors. But queries found... nothing.
The investigation: Step 1: Checked Pinecone console
Result: 0 vectors stored Realization: Embeddings werenβt being saved!
Step 2: Checked upload logs
Found this in my code:
embedding = embedding_service.generate_embedding(text) # β WRONG I was calling the SINGULAR method (for one text) instead of plural method (for batches).
Step 3: Fixed the method, still failed
Error: 403 Your API key was reported as leaked
My Gemini key had been exposed (hardcoded in .env.example that I committed to GitHub). Google auto-blocked it.
Step 4: Regenerated all API keys
Gemini: 768-dim embeddings Cohere: 1024-dim embeddings Pinecone index: 768-dim
Step 5: New error Vector dimension 768 does not match the dimension of the index 1024 The Pinecone index was created for Cohere (1024-dim), but I was now using Gemini (768-dim). Theyβre incompatible.
The solution:
Deleted Pinecone index Created new index with 768 dimensions (for Gemini) Implemented dual-fallback embedding system
def generate_embedding(self, text: str) -> List[float]: """Generate embedding - Gemini first, Cohere fallback"""
# Try Gemini (15K free/month)
if self.gemini_api_key:
try:
result = genai.embed_content(
model="models/text-embedding-004",
content=text,
task_type="retrieval_query"
)
return result['embedding']
except Exception as e:
logger.warning(f"Gemini failed: {e}, trying Cohere...")
# Fallback to Cohere (100 free/month)
if self.cohere_api_key:
try:
response = self.cohere_client.embed(
texts=[text],
model="embed-english-v3.0",
input_type="search_query",
embedding_types=["float"]
)
return response.embeddings.float_[0]
except Exception as e:
logger.error(f"Both services failed: {e}")
return None
return None
Result: Query success rate jumped from 0% to 90%. Lesson: API quotas will hit you when you least expect it. Always have a fallback provider. Never commit API keys, even in example files.
Crash #3: "Relation βdocumentsβ Does Not Exist"
What happened: Deployed to Railway. Backend started. Made first API call. Instant crash. pythonpsycopg2.errors.UndefinedTable: relation "documents" does not exist The mistake: I assumed Railway would auto-create my database tables like my local SQLite did. It didnβt.
What I learned:
Local development: SQLAlchemy created tables automatically Production PostgreSQL: Fresh database, zero tables Alembic migrations: Not configured for Railway deployment
The solution: Manually created tables via Railwayβs PostgreSQL CLI: sqlCREATE TABLE documents ( id SERIAL PRIMARY KEY, user_id VARCHAR(255) NOT NULL, filename VARCHAR(255) NOT NULL, original_filename VARCHAR(255), file_path VARCHAR(500), file_size INTEGER, file_type VARCHAR(50), extracted_text TEXT, page_count INTEGER, chunks JSON, chunk_count INTEGER, embedding_model VARCHAR(100), embedding_dimension INTEGER, status VARCHAR(50) DEFAULT βprocessingβ, upload_date TIMESTAMP DEFAULT NOW(), processed_date TIMESTAMP, is_deleted BOOLEAN DEFAULT FALSE );
CREATE INDEX idx_documents_user_id ON documents(user_id); CREATE INDEX idx_documents_status ON documents(status);
Better solution (learned after): Set up Alembic migrations properly:
alembic/env.py
from app.models import Base
target_metadata = Base.metadata
Then in Railway:
alembic upgrade head
**Result:** Database tables created, app started successfully.
**Lesson:** Always test database migrations in a staging environment that mirrors production. Don't assume cloud providers work like localhost.
---
## **Crash #4: "Failed to Fetch" (CORS Hell)**
**What happened:** Frontend deployed to Vercel. Backend on Railway. They couldn't talk to each other.
**Chrome console:**
Access to fetch at βhttps://backend.railway.app/api/documents/listβ from origin βhttps://frontend.vercel.appβ has been blocked by CORS policy: No βAccess-Control-Allow-Originβ header is present
The mistake: My CORS configuration only allowed localhost:
β WRONG - Only worked locally
app.add_middleware( CORSMiddleware, allow_origins=["http://localhost:5173"], allow_credentials=True, allow_methods=[""], allow_headers=[""], )
The solution:
β CORRECT - Works in production
app.add_middleware( CORSMiddleware, allow_origins=[ "http://localhost:3000", "http://localhost:5173", "https://rag-document-qa-system.vercel.app", # Production frontend ], allow_credentials=True, allow_methods=[""], allow_headers=[""], )
Even better solution (learned later): Use environment variables:
ALLOWED_ORIGINS = os.getenv(
"ALLOWED_ORIGINS",
"http://localhost:5173,https://rag-document-qa-system.vercel.app"
).split(",")
app.add_middleware( CORSMiddleware, allow_origins=ALLOWED_ORIGINS, allow_credentials=True, allow_methods=[""], allow_headers=[""], ) Result: Frontend successfully connected to backend. Lesson: Configure CORS on day 1, not day 20. Test with production URLs before deploying. Use environment variables for flexibility.
Crash #5: Background Tasks Timing Out What happened: Large documents (1000+ chunks) failed with timeout errors. 504 Gateway Timeout
The problem: Processing was synchronous - upload endpoint waited for:
Text extraction (5-10 seconds) Chunking (2-3 seconds) Embedding generation (45-60 seconds for 1000 chunks) Pinecone upload (5-10 seconds)
Total: 60-80 seconds. Railwayβs timeout: 30 seconds.
The solution: Move processing to background tasks
from fastapi import BackgroundTasks
async def process_document_background( document_id: int, file_path: str, file_extension: str ): """Process document asynchronously""" from app.database import SessionLocal
db = SessionLocal()
try:
document = db.query(Document).filter(
Document.id == document_id
).first()
# Extract text
extraction_result = await text_extraction.extract_text(
file_path, file_extension
)
if extraction_result["success"]:
# Chunk text
chunks = chunk_text(
extraction_result["text"],
chunk_size=1000,
overlap=100
)
# Generate embeddings
embeddings = embedding_service.generate_embeddings(chunks)
# Store in Pinecone
pinecone_service.upsert_embeddings(
document_id=document_id,
chunks=chunks,
embeddings=embeddings
)
document.status = "ready"
else:
document.status = "failed"
db.commit()
finally:
db.close()
@router.post("/upload") async def upload_document( background_tasks: BackgroundTasks, file: UploadFile = File(...), db: Session = Depends(get_db), user: dict = Depends(get_current_user) ): # Save file and create database record file_path = await file_storage.save_uploaded_file(file)
document = Document(
user_id=user["sub"],
filename=file.filename,
status="processing"
)
db.add(document)
db.commit()
# Queue background processing
background_tasks.add_task(
process_document_background,
document.id,
file_path,
file.filename.split(".")[-1]
)
return {
"message": "Document uploaded. Processing in background...",
"document_id": document.id,
"status": "processing"
}
Result: Upload endpoint returns in <1 second. Processing happens in background. No timeouts. **Lesson:** Any operation taking >5 seconds should be a background task in production. Return immediately, process asynchronously.
The Security Audit That Changed Everything After getting it "working," I ran CodeRabbitβs security review. Result: 17 vulnerabilities found.
2 CRITICAL:
Hardcoded database password in code CORS wildcard (allow_origins=["*"])
5 HIGH: No rate limiting (DoS vulnerability) No virus scanning on uploads No input sanitization Missing pagination (could load 10K documents at once) SQL injection potential (even with ORM)
The fixes:
Rate Limiting:
from slowapi import Limiter from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address, default_limits=["200/minute"]) app.state.limiter = limiter
Virus Scanning:
Integrated VirusTotal API
async def scan_file(file_path: str) -> Dict[str, Any]: response = requests.get(VIRUSTOTAL_URL, ...)
if response.json()["data"]["attributes"]["stats"]["malicious"] > 0:
return {"is_safe": False}
return {"is_safe": True}
Input Sanitization:
import bleach
query = bleach.clean(request.query.strip(), tags=[], strip=True)
Pagination:
@router.get("/list") async def list_documents( skip: int = 0, limit: int = 100, db: Session = Depends(get_db) ): documents = db.query(Document).offset(skip).limit(limit).all() total = db.query(Document).count()
return {
"documents": documents,
"total": total,
"skip": skip,
"limit": limit
}
Result: All 17 vulnerabilities fixed. System production-hardened. Lesson: Security isnβt optional. Code reviews catch what you miss. Production means thinking about malicious users, not just happy paths.
Production Metrics (The Real Numbers) System Performance:
Metric Value Average Query Time 2.4 seconds Upload Processing (100 chunks) 12 seconds Upload Processing (1000 chunks) 68 seconds Embedding Generation (per chunk) 0.25 seconds Database Query Time 45ms average Pinecone Query Time 180ms average
API Costs (Monthly):
Service Free Tier My Usage Cost Gemini 15K requests ~200/month $0 Cohere 100 requests ~50/month $0 Pinecone 1 index,1M vectors ~5K vectors $0 Railway 500 hours ~720 hours $5 Vercel Unlimited N/A $0 Total: $5/month for a production AI system.
Success Rates:
Document uploads: 95% (failures = corrupted files) Query responses: 90% (10% = no relevant chunks found) Background processing: 92% (8% = text extraction failures)
User Feedback (First Week):
17 documents uploaded 118 queries processed 5 users (mostly testing)
What Iβd Do Differently If I started over tomorrow:
Check API quotas FIRST - Not after hitting them. Geminiβs "free tier" had 0 requests available. Cohere saved me. Set up CORS early enough - Donβt wait until deployment fails. Test with production URLs locally. Database migrations from the start - Alembic configuration before first deployment, not after. Implement background tasks immediately - Any operation >5 seconds should be async from the beginning. Security review before deployment - Not after. CodeRabbit wouldβve caught issues in development. Use environment variables everywhere - No hardcoded values. Even in development. Test with corrupted files - Users will upload anything. Test with 1-byte PDFs, empty files, and non-UTF8 text.
Current Limitations & Future Improvements Known Issues:
Scanned PDFs return 0 characters (needs OCR) Large documents take 60+ seconds to process
Planned Features:
Streaming responses for better UX OCR for scanned PDFs Excel and PowerPoint support Semantic caching to reduce API costs
Key Takeaways Production AI is 20% algorithms, 80% infrastructure. The biggest lessons:
Free tiers lie - "15K requests/month" doesnβt mean you get 15K. Check actual quotas. Always have fallbacks - Gemini fails β Cohere backup. Saved my deployment multiple times. Background tasks are non-negotiable - Anything >5 seconds will timeout in production. Security canβt wait - One hardcoded password = complete compromise. Fix it before deploying. CORS will break you - Configure it early, test with production URLs. Test with real, messy data - Corrupted PDFs, empty files, non-UTF8 text. Users will upload anything. Dimension mismatches are silent killers - 768 vs 1024 dimensions broke everything with no clear error.
The truth about production AI: Tutorials show the happy path. Production is 90% edge cases, rate limits, and error handling.
Try It Yourself Live Demo: @URL GitHub: @BLESSEDEFEM
To build something similar:
Start with document upload + text extraction (get this working first) Add embeddings locally (test with small files) Deploy backend before frontend (easier to debug) Implement CORS from day 1 Monitor API quotas obsessively Add background tasks early Security audit before deployment
Questions? Open an issue on GitHub or connect with me on LinkedIn.
About the Author Blessing Nejo - Junior Software & AI Engineer from Lagos, Nigeria I build production AI systems and document the messy parts that tutorials skip. It was adventure in learning plus hands-on project. This RAG system taught me more in 3 weeks than months of tutorials.
Currently seeking: Software/AI Engineer roles (remote-first) Skills: Python, TypeScript, FastAPI, React, PostgreSQL, Vector Databases, Production AI Systems
Connect:
π LinkedIn: Blessing Nejo π GitHub: @BLESSEDEFEM π§ [nejoblessing72@gmail.com] π Lagos, Nigeria
Found this helpful? Drop a comment below - I read and respond to every one. Building something similar? Iβm happy to review your architecture or debug issues. DM me.
Tags: #AI #MachineLearning #RAG #Python #FastAPI #React #TypeScript #ProductionAI #VectorDatabases #Pinecone #LLM #SoftwareEngineering