Executive Summary
The transition from keyword-based information retrieval to semantic search represents one of the most significant paradigm shifts in data engineering over the last decade. As organizations seek to leverage Large Language Models (LLMs) via Retrieval-Augmented Generation (RAG), the ability to efficiently crawl, embed, and index vast corpora of unstructured data has become a critical competency. However, traditional infrastructure approaches—relying on provisioned virtual machines, long-running Kubernetes clusters, or monolithic server architectures—often struggle to handle the distinct "bursty" nature of mass indexing workloads. A web crawler might sit idle for days and then require thousands of concurrent threads for a few hours; a vector embedding job require…
Executive Summary
The transition from keyword-based information retrieval to semantic search represents one of the most significant paradigm shifts in data engineering over the last decade. As organizations seek to leverage Large Language Models (LLMs) via Retrieval-Augmented Generation (RAG), the ability to efficiently crawl, embed, and index vast corpora of unstructured data has become a critical competency. However, traditional infrastructure approaches—relying on provisioned virtual machines, long-running Kubernetes clusters, or monolithic server architectures—often struggle to handle the distinct "bursty" nature of mass indexing workloads. A web crawler might sit idle for days and then require thousands of concurrent threads for a few hours; a vector embedding job requires massive GPU throughput for short bursts but is financially ruinous to maintain 24/7.
This report provides an exhaustive technical analysis of architecting a serverless mass-indexing pipeline using Modal for compute orchestration and Vector Databases (specifically analyzing Pinecone and Qdrant) for high-dimensional storage. To facilitate a rigorous examination of these technologies, we introduce a fictional yet realistic application scenario: "DocuVerse," a decentralized technical documentation aggregator. This simulation involves the ingestion of millions of technical documents, requiring a pipeline that is robust, scalable, and cost-efficient.
Our analysis extends beyond simple implementation details to explore second-order implications: the graph-theoretical properties of web crawling (the "Matrix Link"), the economics of ephemeral GPU compute, and the nuances of distributed state management in a stateless environment. Furthermore, bridging the gap between deep engineering and public communication, the report concludes with a comprehensive LinkedIn content strategy, including visual "card" designs and a conceptual mind map of the application, designed to communicate these complex architectures to a professional audience.
Part I: The Paradigm Shift in Search Infrastructure
1.1 The Evolution of Retrieval: From Keywords to Vectors
To understand the necessity of the architectures proposed in this report, one must first appreciate the fundamental limitations of the systems they replace. For decades, the industry standard for search was the Inverted Index—a data structure mapping unique terms to the documents containing them (e.g., Apache Lucene, Elasticsearch). While highly efficient for exact keyword matching, inverted indices suffer from "lexical gap": they cannot match a query for "automobile" to a document containing "car" unless explicitly synonymized.
The advent of Transformer-based language models (BERT, RoBERTa, and later GPT) introduced Vector Embeddings. In this paradigm, text is transformed into a high-dimensional vector (often 768 to 1536 dimensions) where semantic meaning is encoded in the geometric distance between points. "Car" and "Automobile" end up in the same neighborhood of this vector space.1
This shift changes the fundamental resource requirements of the indexing pipeline:
CPU to GPU Shift: Inverted indexing is I/O and CPU bound (tokenization). Vector indexing is compute-bound, requiring matrix multiplications best performed on GPUs. 1.
Throughput Sensitivity: The embedding model is a bottleneck. Processing millions of documents through a deep neural network requires massive parallelization that single-server architectures cannot provide. 1.
Storage Complexity: Storing and searching millions of dense vectors requires specialized Approximate Nearest Neighbor (ANN) algorithms (like HNSW), which have different memory and disk IOPS profiles compared to traditional B-Trees.
1.2 The Infrastructure Dilemma: Burstiness vs. Provisioning
Mass indexing events—such as the initial ingestion of a new dataset or a full re-indexing after an embedding model update—are characterized by extreme burstiness.
Consider a documentation platform that crawls the web. For 23 hours a day, traffic is minimal (incremental updates). For 1 hour, a major new library release might trigger a crawl of 100,000 pages.
Provisioned Capacity (e.g., EC2/Kubernetes): If you provision for the peak, you pay for idle GPUs 95% of the time. If you provision for the average, the peak load causes massive latency spikes, violating Service Level Agreements (SLAs).
Traditional Serverless (e.g., AWS Lambda): While scalable, these services often lack GPU support, have restrictive timeouts (15 minutes), and suffer from "cold starts" that make loading large ML models (often gigabytes in size) too slow for real-time responsiveness.
1.3 The Modal Solution
Modal has emerged as a specialized cloud platform designed to solve these specific discrepancies. Unlike general-purpose serverless platforms, Modal is optimized for data-intensive and AI workloads. Its architecture allows for:
Container Lifecycle Management: Modal separates the container image definition from the execution. It employs advanced caching and lazy-loading techniques to launch containers in milliseconds, even those with heavy dependencies like PyTorch or TensorFlow.1
GPU Ephemerality: Functions can request specific GPU hardware (e.g., NVIDIA A10G, H100) on a per-invocation basis. The billing model is per-second of usage, enabling a "scale-to-zero" architecture where the cost of a massive GPU cluster is incurred only during the minutes it is actually crunching data.
Distributed Primitives: Modal provides native distributed data structures (Queues, Dicts) that allow functions to coordinate state without needing an external Redis or message bus.2
This report validates Modal as the foundational compute layer for "DocuVerse," demonstrating how it orchestrates the complex dance of crawling, embedding, and indexing.
Part II: The Fictional Use Case: "DocuVerse"
To ground our architectural decisions in reality, we define the specifications of DocuVerse.
2.1 Mission and Scope
DocuVerse is a "Universal Documentation Search Engine" for developers. It aggregates technical documentation from:
Official Sources: Python docs, MDN, AWS documentation. 1.
Community Sources: Stack Overflow archives, GitHub Wikis. 1.
Decentralized Web: Technical whitepapers hosted on IPFS/Arweave.
The goal is to provide a single search bar that retrieves the most relevant technical answers using RAG, regardless of where the information lives.
2.2 Dataset Specifications (Fictional Data)
| Metric | Value | Implications |
|---|---|---|
| Total Documents | 5,000,000 | Requires efficient bulk indexing strategies. |
| Average Doc Size | 4 KB (approx. 800 tokens) | Fits within standard embedding context windows; chunking may be minimal. |
| Update Velocity | ~200,000 docs/day | Incremental indexing must be robust. |
| Vector Dimensions | 1,536 (OpenAI Ada-002 compatible) | Standard high-fidelity dimensionality. |
| Total Index Size | ~30 GB (Vectors + Metadata) | Fits in memory for some DBs, requires disk-offload for others. |
| Target Latency | < 200ms (Search), < 15 min (Index Freshness) | Tight constraints on the ingestion pipeline. |
2.3 The "Matrix Link" Requirement
Beyond simple text search, DocuVerse aims to implement a "PageRank-for-Code" algorithm. It must construct a graph of how documentation pages link to each other (e.g., how many pages link to the React useEffect hook documentation?). This "Matrix Link" 3 will be used to boost the relevance of authoritative pages during vector retrieval. This adds a complexity layer: the crawler must not just extract text, but also preserve the adjacency matrix of the web graph.
Part III: Architecting the Distributed Crawler on Modal
The ingestion layer is the gateway to the system. Building a crawler that can handle 5 million pages without getting blocked, crashing, or entering infinite loops requires a sophisticated distributed architecture.
3.1 The Producer-Consumer Pattern using modal.Queue
In a monolithic script, crawling is a recursive function: visit(url) -> find_links() -> visit(links). In a serverless environment, deep recursion leads to stack overflows or timeout errors. We must flatten this recursion into a Queue-Based Architecture.2
The Architecture Design:
The Frontier Queue: A modal.Queue named crawl-frontier. This persistent queue holds the URLs waiting to be visited. It acts as the buffer between the discovery of work and the execution of work.
1.
The Seed Injector: A scheduled function (@app.function(schedule=modal.Cron(...))) 5 that runs periodically (e.g., every morning at 02:00 UTC) to push known "root" URLs (e.g., https://docs.python.org/3/) into the Frontier Queue. This kickstarts the process.
1.
The Fetcher Swarm: A set of worker functions that pop() items from the queue. This is where Modal’s auto-scaling shines. We can configure the Fetcher to scale between 0 and 500 concurrent containers depending on the queue length.
Why Not modal.map?
While modal.map allows parallel execution over a list, it is static. It expects the list of inputs to be known beforehand. A crawler is dynamic—parsing Page A reveals Page B and C. The Queue pattern is essential here because it allows the workload to expand dynamically during runtime.5
3.2 State Management: The Deduplication Matrix
To prevent infinite loops (Page A links to B, B links to A) and to ensure we don’t waste compute crawling the same page twice, we need a shared state of visited URLs.
The Distributed Dictionary:
We employ modal.Dict as a shared key-value store accessible by all 500 fetcher containers simultaneously.2
Key: The URL (normalized).
Value: A metadata object containing timestamp, hash (for content change detection), and status.
Consistency Challenge:
In a high-concurrency environment, a race condition exists: two workers might pop the same URL or discover the same link simultaneously. modal.Dict provides atomicity guarantees for operations, ensuring that visited.put_if_absent(url) is thread-safe across the distributed cluster.
3.3 The "Matrix Link" Construction
As referenced in the research 3, the structure of the web is an adjacency matrix. Most crawlers discard this structure, keeping only the content. DocuVerse preserves it.
Implementation:
When the Fetcher parses a page, it extracts two distinct datasets:
Content: The text for vectorization. 1.
Edges: A list of outbound links.
These edges are pushed to a secondary link_matrix_queue. A separate aggregator function reads this queue and builds a sparse matrix representation of the documentation graph. This matrix is later used to calculate "Authority Scores" for each document, which will be stored as metadata in the Vector Database. This approach leverages Graph Neural Network (GNN) concepts where the link structure informs the semantic importance of the node.4
3.4 Handling Politeness and Anti-Bot Measures
A naive crawler scaling to 500 containers will resemble a DDoS attack to the target server. We must implement Politeness Sharding.
The Sharded Queue Strategy:
Instead of one global queue, we logically partition the work by domain.
Worker Type A: Processes *.github.io (Concurrency Limit: 5).
Worker Type B: Processes *.readthedocs.io (Concurrency Limit: 10).
Worker Type C: General Web (Concurrency Limit: 100).
In Modal, this is achieved by defining different Functions with different concurrency_limit decorators, all consuming from filtered views of the main queue or separate domain-specific queues. This ensures that while the aggregate throughput of DocuVerse is high, the per-domain impact remains respectful of robots.txt etiquette.
Part IV: The Processing Core: Embeddings & GPU Orchestration
Once the raw HTML is secured, the pipeline shifts from network-bound (crawling) to compute-bound (embedding). This is the most expensive phase of the operation and where Modal’s value proposition is strongest.
4.1 The Container Loading Advantage
In traditional container orchestration (like Kubernetes), adding a new GPU node and pulling a Docker image containing a 5GB PyTorch model can take several minutes. This latency makes it difficult to react to a sudden influx of 50,000 documents.
Modal solves this with a highly optimized container runtime.1
Image Snapshotting: The file system of the container (including the installed Python packages and the model weights) is snapshot. 1.
Lazy Loading: When a function is invoked, Modal mounts this snapshot over the network. Data is read on-demand. 1.
Result: A container capable of running a BERT-large model can boot in under 2 seconds.
Implication for DocuVerse:
This allows us to treat the Embedding Function as a purely on-demand resource. We do not need to keep a "warm pool" of GPU servers running. If the crawler finds a new pocket of documentation, Modal instantly spins up 50 GPU containers to process it and shuts them down the second the queue is empty.
4.2 Batching Strategy for Throughput
GPUs are throughput devices, not latency devices. Sending one document at a time to a GPU is inefficient due to the overhead of moving data from CPU RAM to GPU VRAM.
The Batcher Pattern:
We insert a "buffer" function between the Crawler and the Embedder.
Crawler: Pushes text chunks to embedding_input_queue.
1.
Batcher: A lightweight CPU function that pulls from the queue and accumulates items until it reaches a batch size of 128 or a timeout of 500ms. 1.
Dispatcher: The Batcher sends the List (batch of 128) to the GPU Embedding Function.
This ensures that every time we pay for a GPU cycle, we are utilizing its matrix multiplication cores to their maximum capacity.
4.3 Model Selection and Quantization
For DocuVerse, we have two primary options for embeddings:
API-Based (e.g., OpenAI): Simple to implement but costly at scale ($0.10 per million tokens can add up with 5 million docs re-indexed weekly). 1.
Self-Hosted (e.g., multilingual-e5-large): Running open-source models on Modal’s GPUs.
We choose the Self-Hosted approach for this architecture to demonstrate the capability. We utilize the multilingual-e5-large model, which provides state-of-the-art performance for technical text.6
Quantization:
To reduce the memory footprint in the Vector Database and speed up search, we apply Scalar Quantization (converting 32-bit floats to 8-bit integers) within the embedding function. This reduces the index size by 4x with minimal loss in retrieval accuracy (Recall@10).
Part V: The Vector Database Layer: Storage and Indexing
The vectors produced by our GPU workers need a home. We analyze two leading contenders, Pinecone and Qdrant, and how they integrate into this serverless pipeline.
5.1 Pinecone: The Serverless Standard
Pinecone’s recent "Serverless" offering 7 aligns perfectly with our architecture. Unlike their previous "Pod-based" model where users provisioned capacity, the serverless model decouples storage from compute.
Architecture Benefits:
Separation of Concerns: Vectors are stored in blob storage (S3-compatible) and loaded into the index only when needed. This means we can store 5 million vectors cheaply, even if we rarely search the "long tail" of the data.
Mass Indexing via Object Storage: For the initial load of DocuVerse (the "Bootstrap" phase), pushing vectors one by one via API is too slow. Pinecone allows bulk import from object storage.8 Our Modal pipeline can write Parquet files to an S3 bucket, and Pinecone can ingest them asynchronously. This is the fastest and most cost-effective way to build the initial index.
Integration Strategy:
We use a Hybrid Search index. We store both the dense vector (from the GPU model) and a sparse vector (BM25) for keyword matching. This ensures that if a user searches for a specific error code (e.g., "Error 503"), the keyword match takes precedence over semantic similarity.9
5.2 Qdrant: The High-Performance Alternative
Qdrant offers a different value proposition. It is open-source and can be run as a managed cloud service or self-hosted.
HNSW Graph Construction:
Qdrant uses the Hierarchical Navigable Small World (HNSW) algorithm.9 Constructing this graph is computationally expensive.
Insight: During mass indexing, inserting vectors and updating the graph in real-time destroys performance.
Optimization: We configure the Qdrant client to disable "optimization" (graph re-balancing) during the bulk upload. Once the upload is complete, we trigger a forced optimization. This reduces total indexing time by approximately 60%.
LangChain Integration:
Qdrant has deep integration with LangChain.11 We can leverage the QdrantVectorStore class to handle metadata filtering out of the box. For DocuVerse, metadata is crucial.
Filter Example: filter={"project": "react", "version": "18.0"}.
This allows the search engine to respect the structure of the documentation sets.
5.3 The DocuVerse Decision
For the primary architecture, we select Pinecone Serverless for the production index due to its zero-maintenance elasticity. However, we utilize Qdrant (running ephemerally in a Modal Sandbox) for testing and development pipelines, allowing developers to run the full stack locally without incurring cloud costs.
Part VI: Retrieval and Integration (RAG)
The ultimate consumer of our index is the RAG pipeline.
6.1 The LangChain Orchestrator
We use LangChain to wire the components together.11
User Query: "How do I mount a volume in Modal?" 1.
Query Embedding: The query is sent to the same Embedding Function (hosted on Modal) used for indexing. This ensures the query vector and document vectors are in the same latent space. 1.
Retrieval: LangChain queries Pinecone with the vector + filters (e.g., "only show me docs updated in the last year"). 1.
Re-Ranking: To improve precision, we fetch 50 candidates and pass them through a Cross-Encoder model (also hosted on Modal) to re-rank them. This is more expensive but guarantees higher relevance. 1.
Synthesis: The top 5 chunks are passed to GPT-4 via the OpenAI API to generate the answer.
6.2 The "Matrix Link" Boost
Here, our earlier graph work pays off. When retrieving results, we apply a boosting factor based on the "Authority Score" calculated during the crawl.
Score Formula: Final_Score = (Vector_Similarity * 0.8) + (PageRank_Score * 0.2)
This ensures that the "official" documentation page (which has many incoming links) ranks higher than a random forum post (which has few), even if the forum post has slightly higher semantic similarity.4
Part VII: Operational Resilience and Observability
Building a distributed system on fictional data is easy; running it in production is hard.
7.1 The Dead Letter Queue (DLQ)
In a system processing millions of items, 0.1% will fail. The HTML might be malformed; the embedding model might encounter a token limit.
Pattern: We define a dlq_queue in Modal.
Mechanism: Wrap the processing logic in a try/except block. On exception, serialize the input + the error traceback and push it to the DLQ.
Recovery: A separate "Janitor" function runs daily to inspect the DLQ. It can either retry the jobs (if the error was transient, like a network timeout) or alert a human.
7.2 Idempotency and Determinism
The pipeline must be idempotent. If a worker crashes after writing to Pinecone but before acknowledging the queue message, the message will be re-delivered.
- Solution: We generate Document IDs deterministically using a hash of the URL (
sha256(url)). If we try to write the same document to Pinecone twice, the second write simply overwrites the first with identical data. No duplicates are created.13
7.3 Cost Monitoring
To prevent "wallet-denial-of-service", we implement budget guards.
Token Counting: We track the total tokens processed by the Embedding Function.
Circuit Breaker: If the daily spend exceeds a threshold (e.g., $50), the seed_injector function is disabled, pausing new crawls until the next billing cycle or manual override.
Part VIII: LinkedIn Content Strategy & Visuals
To effectively communicate the sophistication of the DocuVerse architecture to a professional network, we need a content strategy that bridges the gap between high-level value and low-level engineering.
8.1 The "Hook" and Narrative
Headline: "How I Built a ‘Google for Code’ Indexing 5 Million Pages for <$50."
Narrative Arc:
The Villain: The "Idle Resource". Identifying the waste in traditional provisioned clusters. 1.
The Hero: The "Serverless Trinity" (Modal + Pinecone + LangChain). 1.
The Climax: The "Mass Indexing Event"—scaling from 0 to 500 GPUs in seconds. 1.
The Resolution: A predictable, low-cost bill and a high-performance search engine.
8.2 Card Suggestions (Visual Assets)
Card 1: The "Cold Start" Myth
Visual: A stopwatch comparing "Standard Docker" (2 min) vs. "Modal Snapshot" (1.5 sec).
Text: "Serverless GPUs used to be too slow for real-time AI. Not anymore. Container snapshotting changes the physics of cold starts." 1
Card 2: The Architecture Map
Visual Strategy: Instead of a static image, use this flow diagram to illustrate the "Producer-Consumer" decoupling that enables scale.
Diagram:
Snippet de código
flowchart TD
subgraph Ingestion ["Ingestion Layer (CPU)"]
Seed(Seed Injector) --> Frontier[Frontier Queue]
Frontier --> Crawler
Crawler -->|HTML| Parser
Crawler -->|Links| Frontier
end
subgraph Processing ["Processing Layer (GPU)"]
Parser -->|Text Chunks| BatchQueue[Embedding Queue]
BatchQueue --> Batcher
Batcher -->|Batch of 128| Embedder
Embedder -->|Vectors| VectorBuffer
end
subgraph Storage
VectorBuffer -->|Bulk Import| S3
S3 -->|Async Ingest| Pinecone
Crawler -.->|Deduplication| Dict
end
subgraph Retrieval ["Interaction Layer"]
User -->|Query| API
API -->|Embed Query| Embedder
API -->|Search| Pinecone
Pinecone -->|Results| RAG
RAG --> User
end
Card 3: The "Matrix Link"
Visual: A network graph with nodes glowing. One central node is brighter.
Text: "Vectors aren’t enough. We mapped the adjacency matrix of 5 million docs to boost ‘Authority’ alongside ‘Similarity’. This is RAG + Graph Theory." 3
Card 4: The Cost Curve
Visual: A graph showing a flat line (Cost) overlaying a spiky line (Traffic), compared to a blocky "Provisioned" cost line.
Text: "Stop paying for air. Scale to zero means your infrastructure bill hits $0.00 when your users sleep."
8.3 Application Mind Map
The following mind map illustrates the four pillars of the DocuVerse engine: Ingestion, Processing, Memory, and Interaction.
Snippet de código
mindmap
root((DocuVerse<br/>Engine))
Ingestion
Crawler Swarm
Politeness Sharding
Deduplication
Frontier Queue
Seed Injector
Processing
HTML Parser
Graph Builder
Matrix Link
Batcher
Embedder
Model: e5-large
Quantization: 8-bit
Memory
Pinecone Serverless
S3 Bucket
DLQ Error Handler
Interaction
API Endpoint
LangChain Orchestrator
RAG Pipeline
Part IX: Comparison Data and Fictional Metrics
To further illustrate the efficiency of this architecture, we present fictional performance data derived from the "DocuVerse" simulation.
9.1 Cost Comparison: Serverless vs. Provisioned
| Component | Architecture A: Kubernetes (EKS) + P3 Instances | Architecture B: DocuVerse (Modal + Pinecone) | Savings |
|---|---|---|---|
| Compute (Crawler) | $450/mo (3 nodes always on) | $42/mo (Pay per CPU-second) | 90% |
| Compute (GPU) | $2,200/mo (p3.2xlarge reserved) | $150/mo (A10G spot, burst usage) | 93% |
| Vector DB | $300/mo (Managed Instance) | $45/mo (Serverless Usage-Based) | 85% |
| DevOps Labor | 10 hrs/mo (Cluster maintenance) | 1 hr/mo (Config tweaks) | 90% |
| Total Monthly | ~$2,950 | ~$237 | ~92% |
Table 1: Monthly operational cost projection for indexing 5M documents with daily updates.
9.2 Throughput Metrics
| Operation | Metric | Note |
|---|---|---|
| Crawling Speed | 1,200 pages/sec | Scaled to 300 concurrent containers. |
| Embedding Rate | 4,500 docs/sec | Utilizing 50 concurrent A10G GPUs with batch size 128. |
| Indexing Rate | 10,000 vectors/sec | Bulk upsert to Pinecone via S3 import. |
| Cold Start Latency | 1.8 seconds | Time to boot fresh container + load model weights.1 |
Table 2: Performance benchmarks observed during the "MegaCorp" documentation ingestion simulation.
Conclusion
The "DocuVerse" case study illustrates a powerful truth about modern data engineering: Architecture is the new Optimization.
In the past, optimizing a search engine meant writing faster C++ code to tokenize strings. Today, it means composing the right set of serverless primitives to handle the physics of data movement and model inference.
Modal provides the elastic compute fabric, solving the "bursty" nature of crawling and embedding.
Vector Databases like Pinecone and Qdrant provide the semantic storage layer, solving the retrieval problem.
Graph Theory (the Matrix Link) provides the relevance signal, solving the authority problem.
By treating the cloud not as a collection of servers, but as a single, programmable computer, engineers can build systems that are orders of magnitude more efficient—both in cost and performance—than their predecessors. The era of the "Serverless Semantic Engine" is here, and it is accessible to any developer willing to embrace these new paradigms.
Appendix: DocuVerse Reference Implementation
This section provides the reference source code for the core logic of the "DocuVerse" engine. The application is structured as a Modal package.
A.1 src/common.py - Shared Structures
Defines the data models and shared configuration.
from dataclasses import dataclass
from typing import List, Optional
# Constants
QUEUE_NAME = "docuverse-frontier"
DICT_NAME = "docuverse-visited"
EMBED_QUEUE = "docuverse-embeddings"
LINK_MATRIX_QUEUE = "docuverse-matrix"
@dataclass
class Document:
url: str
content: str
title: str
links: List[str]
doc_hash: str
metadata: dict
@dataclass
class VectorRecord:
id: str
values: List[float]
metadata: dict
A.2 src/crawler.py - The Distributed Fetcher
Implements the Producer-Consumer pattern with modal.Queue and the Matrix Link extraction.
import modal
import hashlib
from.common import Document, QUEUE_NAME, DICT_NAME, EMBED_QUEUE, LINK_MATRIX_QUEUE
# Define the container image with necessary scraping libraries
crawler_image = modal.Image.debian_slim().pip_install("beautifulsoup4", "requests")
app = modal.App("docuverse-crawler")
# Persistent State
frontier_queue = modal.Queue.from_name(QUEUE_NAME, create_if_missing=True)
visited_db = modal.Dict.from_name(DICT_NAME, create_if_missing=True)
embed_queue = modal.Queue.from_name(EMBED_QUEUE, create_if_missing=True)
matrix_queue = modal.Queue.from_name(LINK_MATRIX_QUEUE, create_if_missing=True)
@app.function(image=crawler_image, concurrency_limit=300)
def fetch_url(url: str):
import requests
from bs4 import BeautifulSoup
# Idempotency check
if url in visited_db:
return
try:
response = requests.get(url, timeout=5)
if response.status_code!= 200:
return
soup = BeautifulSoup(response.text, 'html.parser')
# 1. Extract Content
text = soup.get_text()
title = soup.title.string if soup.title else url
doc_hash = hashlib.sha256(text.encode()).hexdigest()
# 2. Extract Matrix Links (Graph Edges)
links =
normalized_links = [l for l in links if l.startswith('http')] # Simplified logic
doc = Document(
url=url,
content=text[:5000], # Truncate for demo
title=title,
links=normalized_links,
doc_hash=doc_hash,
metadata={"source": "crawler"}
)
# 3. Mark as visited
visited_db[url] = {"hash": doc_hash, "status": "processed"}
# 4. Dispatch for Processing
# Push content to embedding queue
embed_queue.put(doc)
# Push edges to matrix calculator queue
matrix_queue.put({"source": url, "targets": normalized_links})
# 5. Expand Frontier
for link in normalized_links:
if link not in visited_db:
frontier_queue.put(link)
except Exception as e:
print(f"Failed to crawl {url}: {e}")
@app.function(schedule=modal.Cron("0 2 * * *"))
def seed_injector():
"""Daily job to restart the crawl from root nodes."""
roots = ["https://docs.python.org/3/", "https://react.dev"]
for url in roots:
frontier_queue.put(url)
A.3 src/embedder.py - GPU Batch Processing
Uses modal.cls to maintain the model state (weights) in GPU memory between invocations.
import modal
from typing import List
from.common import Document, VectorRecord, EMBED_QUEUE
# Define a GPU-enabled image with PyTorch and Transformers
gpu_image = (
modal.Image.debian_slim()
.pip_install("torch", "transformers", "sentence-transformers")
)
app = modal.App("docuverse-embedder")
@app.cls(gpu="A10G", image=gpu_image, container_idle_timeout=300)
class ModelService:
def __enter__(self):
from sentence_transformers import SentenceTransformer
# Load model once when container starts (Cold Start optimization)
self.model = SentenceTransformer('intfloat/multilingual-e5-large')
@modal.method()
def embed_batch(self, docs: List) -> List:
texts = [d.content for d in docs]
# Generate dense vectors
embeddings = self.model.encode(texts, normalize_embeddings=True)
records =
for doc, emb in zip(docs, embeddings):
records.append(VectorRecord(
id=doc.doc_hash,
values=emb.tolist(),
metadata={"url": doc.url, "title": doc.title}
))
return records
@app.function(image=modal.Image.debian_slim())
def batch_coordinator():
"""Reads from queue, batches items, and sends to GPU."""
embed_queue = modal.Queue.from_name(EMBED_QUEUE)
service = ModelService()
batch =
BATCH_SIZE = 64
while True:
# Fetch items with a short timeout
try:
items = embed_queue.get_many(BATCH_SIZE, block=True, timeout=5.0)
if not items:
break
# Invoke GPU function
vectors = service.embed_batch.remote(items)
# TODO: Send vectors to Pinecone/Qdrant
# pinecone_upload.remote(vectors)
except Exception:
break
A.4 src/vector_db.py - Pinecone Integration
Demonstrates the bulk upload strategy via S3 (Conceptual code).
import modal
import os
app = modal.App("docuverse-vectordb")
@app.function(
secrets=
)
def bulk_upsert(parquet_file_path: str):
from pinecone import Pinecone
import boto3
# 1. Upload Parquet to S3
s3 = boto3.client('s3')
bucket = "docuverse-ingest-bucket"
key = f"imports/{os.path.basename(parquet_file_path)}"
s3.upload_file(parquet_file_path, bucket, key)
# 2. Trigger Pinecone Import
pc = Pinecone(api_key=os.environ)
idx = pc.Index("docuverse-prod")
# Start async import
idx.start_import(
uri=f"s3://{bucket}/{key}",
integration_id="s3-integration-id"
)
print("Bulk import started.")