For our Nexhacks project we wanted to explore a problem within predictions markets.
We started with a pretty simple frustration: prediction markets are powerful, but they’re hard to reason about unless you already think in probabilities. Most interfaces show a final price and expect users to translate/calculate this price into a risk, confidence, and position sizing. That gap is where people get confused.
Users will make bets based on gut decisions but do you know:
- Your expected value if you're 80% confident?
- The Kelly-optimal position size?
- Your 95% Value at Risk?
- Whether the risk-reward ratio favors you?
Our original goal was modest. Build something that helps people understand how a prediction market pos...
For our Nexhacks project we wanted to explore a problem within predictions markets.
We started with a pretty simple frustration: prediction markets are powerful, but they’re hard to reason about unless you already think in probabilities. Most interfaces show a final price and expect users to translate/calculate this price into a risk, confidence, and position sizing. That gap is where people get confused.
Users will make bets based on gut decisions but do you know:
- Your expected value if you're 80% confident?
- The Kelly-optimal position size?
- Your 95% Value at Risk?
- Whether the risk-reward ratio favors you?
Our original goal was modest. Build something that helps people understand how a prediction market position behaves as probabilities and time change, using real Polymarket data. With our risk decision structure, we will provide a clear +EV, -EV or no edge for the simulated bet.
The Problem We Didn’t Expect
It was 4AM. We were integrating multiple sponsor tools at once: Polymarket data, numeric reasoning, explanations, evaluation, observability. Each one worked fine on its own, but stitching them together espeically with the hackaton time pressure and no sleep after my flght.
Every change required touching multiple pieces. Debugging meant guessing where things went wrong. And as the system grew, we realized the biggest risk wasn’t performance or features it was coordination. The answer here wasnt just 'more instances" or AI tokens
That’s when we stopped treating DevSwarm like “an AI coding tool” and started using it as the backbone of how we worked.
Reframing the System
Instead of thinking in terms of endpoints and API calls, we asked a different question:
“What if analysis itself was a pipeline of specialists?”
So we broke our /explain flow into clear stages, each with a single responsibility, and let DevSwarm handle the orchestration between them.
At runtime, a user interaction kicks off a sequence that looks like this:
- One agent focuses purely on numeric reasoning (expected value, Kelly sizing, risk metrics)
- One agent focuses on compressing context so we stay efficient
- One agent turns that structured data into a clear explanation
- One agent evaluates the result for consistency and quality
Each stage runs independently, hands off structured output, and reports its timing and status.
DevSwarm Runtime Agent Pipeline
Context Builder (Wood Wide AI, ~50ms)
↓
Compressor (Token Company, ~25ms)
↓
Tutor (LLM, ~150ms)
↓
Evaluator (Arize Phoenix, ~15ms)
Each stage runs independently and reports timing and status in real time.
How it all fits together
With DevSwarm, we built infrastructure.
The agent pipeline isn't just for show—it's how we:
- Debugged sponsor integrations in real-time
- Measured each sponsor's contribution
- Showed users exactly how their analysis was computed
On top of our team, It was 6 DevSwarm builder agents working in parallel, each with:
- Dedicated git branch (git-native isolation)
- Specialized prompt (500+ lines of context per agent)
- Clear ownership of files and responsibilities
- Conventional commit prefixes for merge coordination
We split development into parallel DevSwarm builders, each working on a dedicated branch with clear ownership:
- one focused on backend data and Polymarket integration,
- one on numeric reasoning,
- one on evaluation and observability,
- one on UI polish,
- and one whose only job was keeping the demo stable.
This sounds obvious in hindsight, but it completely changed our pace. We stopped stepping on each other’s work. Merges became predictable. The main branch stayed demo-ready, which is critical in a hackathon.
Our judge agent is scanning for defects and issues against the scoring criteria

Older judge verdicts, which we used for continuous integration

Conditional "GO" state, after clearing major issues and reaching a high score on the product and technical judge

What We Ended Up Shipping
By the end of the 24 hours, we had something that felt surprisingly cohesive:
- A 3D visualization that shows how risk and payoff evolve as probabilities and time change
- A numeric engine that computes EV, Kelly sizing, Sharpe-style metrics, and VaR
- Live Polymarket data flowing through the system
- An automated evaluation loop using Arize Phoenix that let us measure and improve explanation quality instead of guessing
None of this felt rushed, even though the clock was always ticking. DevSwarm absorbed most of the coordination overhead that usually slows teams down.
| Agent | Sponsor | Function | Timing |
|---|---|---|---|
context_builder |
Wood Wide AI | Compute Kelly, Sharpe, VaR, EV, and break-even metrics | ~50ms |
compressor |
The Token Company | Reduce context size by 40%+ and track estimated cost savings | ~25ms |
tutor |
LLM | Generate a structured 3-part explanation from compressed context | ~150ms |
evaluator |
Arize Phoenix | Score explanation against numeric ground truth and consistency checks | ~15ms |
The Biggest Takeaway
The most valuable insight wasn’t about speed or AI.
- Parallel execution where possible
- Structured handoffs between agents
- Git-native versioning of agent configurations
- Real-time visibility into each agent's contribution
When users can see how an answer is produced not just the answer itself they engage differently. They experiment more. They ask better questions. They actually learn.
For us, DevSwarm changed how we thought about structuring intelligence, both at runtime and during development.
Looking Back
We placed 3rd Overall at NexHacks, which was an awesome outcome, especially since this was only my second hackathon. But the bigger win was realizing there’s a better way to build AI-heavy products under pressure and playing with all the different new integrations.
This is not by hiding complexity, but by organizing it. Once we experienced that, it became hard to imagine going back to the old way.
Check out the result here!


