By Zhongchen Shen, Yehong Wang
At Coinbase, our journey into the regulated derivatives space with futures trading presented a critical challenge: establishing a robust risk management system. Initially, a strategic decision in 2022 led us to adopt a vendor solution. This allowed us to expedite development and leverage existing expertise, leading to a successful product launch in August 2023. Our team gained invaluable knowledge during this integration, understanding the intricacies of a functional risk system.
However, as our user base rapidly expanded, we began encountering significant limitations with the vendor system. These issues, particularly around system capacity, support, and availability, became critical as we aimed for 24x7 trading and anticipatory large increase of user …
By Zhongchen Shen, Yehong Wang
At Coinbase, our journey into the regulated derivatives space with futures trading presented a critical challenge: establishing a robust risk management system. Initially, a strategic decision in 2022 led us to adopt a vendor solution. This allowed us to expedite development and leverage existing expertise, leading to a successful product launch in August 2023. Our team gained invaluable knowledge during this integration, understanding the intricacies of a functional risk system.
However, as our user base rapidly expanded, we began encountering significant limitations with the vendor system. These issues, particularly around system capacity, support, and availability, became critical as we aimed for 24x7 trading and anticipatory large increase of user base. It became clear that a pivot was necessary.
The Pain Points of a Vendor System
Our experience with the vendor system highlighted several key technical challenges:
Data Inconsistency: The system transmitted different risk measures (e.g., Initial Margin, futures positions) in separate, independent messages. This led to a lack of a consistent, real-time snapshot of a user’s risk profile, creating a poor user experience where displayed metrics could be associated with different underlying data.
High Latency and Data Staleness: With vendor servers not on-premise, network transmission latency significantly impacted end-to-end throughput. This was unacceptable for real-time risk assessment and posed regulatory risks.
Difficulty with Full Recoverability: Architectural limitations meant the vendor system struggled with fast recovery from historical points, leading to operational disruptions.
Other Operational Hurdles: Mandatory weekly maintenance downtimes further hindered our ability to offer 24x7 futures trading.
Introducing STARK: Our In-House Solution
To address these critical issues, we embarked on building our in-house risk management system, STARK. Our design principles focused on overcoming the vendor’s shortcomings:
**Data Consistency through Snapshots: **Our new design prioritized data consistency as a foundational architectural principle. Instead of generating individual measures, STARK generates atomic snapshots. For every portfolio event (account balance update, futures trade, etc.), STARK calculates all the immediate and cascading risk measures as a result of that event, and bundles them together into a single, consistent output message. In addition, every generated snapshot message also includes references to the input data, ensuring complete traceability. We know exactly what combination of inputs contributed to every risk measure output.
STARK generates comprehensive snapshots where all related risk measures are bundled together and internally consistent. For every portfolio event, the system calculates and publishes a single, traceable snapshot of all immediate and cascading risk measures.
Low Latency and High Throughput: By hosting all infrastructure on-premise, we minimized network latency.
To achieve minimal latency and high throughput, we made a conscious decision to trade distribution for speed. While a distributed architecture offers ultimate scalability, it introduces communication latency. We chose to squeeze the majority of the computationally heavy processes into a single, powerful risk calculation engine.
This centralized design was not a guess. We performed a Proof-of-Concept load test pre-implementation to verify that a centralized design is able to support our current and projected future load, allowing us to confidently move forward. The design itself is proven to scale horizontally.
The Result: The average data staleness plummeted from >10 seconds down to 10 milliseconds.
Rapid and Deterministic Recoverability: In a 24x7 trading environment, recovery time is a measure of operational resilience.
Leveraging a Kafka event stream for all input and output data, STARK ensures deterministically sequenced events. When a crash occurs, the system’s state can be reliably and quickly restored by starting with a periodic state snapshot (leveraging our Prime Brokerage’s framework) and then replaying only the historical events that occurred since that snapshot.
Furthermore, the service runs an active/passive redundancy model, ensuring at least one passive instance maintains the exact same state as the active one at all times. If the active instance unexpectedly fails, the passive one takes over instantaneously.
The Result: The recovery process now takes a matter of seconds, not hours.
Rigorous Testing and Zero-Downtime Migration
Replacing a live, mission-critical system required more than just technical solutions; it required an ironclad deployment and testing strategy. Our strategy focused on a smooth cutover and ensuring the new system’s superiority:
Decoupling for Smooth Cutover: We implemented an abstract data access layer, decoupling the risk management system from upstream and downstream services. This allowed consumers to seamlessly toggle between the old and new systems.
Automated Equivalency Testing in Shadow Mode: STARK ran in shadow mode for over two months, undergoing automated equivalency testing. This involved comparing risk measures generated by both systems. For some metrics (e.g. futures positions, initial margin), exact matches were expected. For market-data-driven measures, a fuzzy match with an acceptable variation range was employed due to inherent timing and data feed cadences.
Value of Manual Testing: Despite extensive automation, manual testing proved invaluable, occasionally revealing that STARK’s output was correct even when it differed from the vendor’s.
Scenario Testing for Risk Trigger Calculation: The complex logic for deriving risk triggers (determining risk zones like healthy, warning, danger, liquidation) was modeled as a state machine. We developed 105 scenario test cases covering all possible state transitions, providing high confidence in this critical component. This proactive approach, though a learning curve, has resulted in zero issues with this logic since launch.
The Wins and the Future with STARK
The migration, completed in less than two quarters by Q3 2024, has yielded significant benefits:
Performance: Data staleness reduced by 400x to low millisecond latency.
Reliability: Data inconsistency alerts are now virtually eliminated. Incident counts have reduced by 50% between H2 2024 and H1 2025.
Operational Excellence: We successfully launched 24x7 trading and onboarded institutional clients with much higher requests per second capacity.
Business Growth: Our system now handles over millions of trades weekly, with total trading volume increasing 5x and total users 3x since the migration. STARK has been a key enabler for this sustained growth.
Our journey with STARK underscores a crucial lesson: the decision between vendor solutions and in-house development is complex, requiring a clear understanding of trade-offs, timely decisions, and a long-term strategic vision. When the time comes to make a difficult pivot, embracing the challenge and executing decisively can lead to transformative results.