TL;DR
What we built: A fully automated Binance Level 2 order book streaming system on AWS Free Tier Cost: Uses AWS credits (~$15/month worth) for 260K+ snapshots/day across BTC, ETH, SOL Team: 6 WorldQuant University students with IAM-managed access Tech stack: EC2 (t3.micro) + Python + S3 + CloudWatch + Systemd Dataset size: ~100-130 GB (historical + 6 months live streaming) Research focus: Liquidity stress detection in cryptocurrency markets Result: 100% autonomous operation with zero manual intervention
Weβve all been there. You have a brilliant idea that needs collaboration and lots of data to succeed, and suddenly access and budget become your biggest obstacles.
Having decided to pick up my long-held interest in research again, I spoke β¦
TL;DR
What we built: A fully automated Binance Level 2 order book streaming system on AWS Free Tier Cost: Uses AWS credits (~$15/month worth) for 260K+ snapshots/day across BTC, ETH, SOL Team: 6 WorldQuant University students with IAM-managed access Tech stack: EC2 (t3.micro) + Python + S3 + CloudWatch + Systemd Dataset size: ~100-130 GB (historical + 6 months live streaming) Research focus: Liquidity stress detection in cryptocurrency markets Result: 100% autonomous operation with zero manual intervention
Weβve all been there. You have a brilliant idea that needs collaboration and lots of data to succeed, and suddenly access and budget become your biggest obstacles.
Having decided to pick up my long-held interest in research again, I spoke to a couple of colleagues at WorldQuant University and we decided to just get it done. We wanted a fully open source project with a high-frequency finance dataset. But there were two big problems:
Each teammate needed access to ~30-50 GB of data. Downloading individually? Terrible for bandwidth and money.
We wanted the project to be truly reproducible, so anyone with the right technique could replicate it cheaply.
Our solution: AWS Free Tier + careful planning.
The Team
Our 6-person WorldQuant University research team:
- Kalu Goodness (me)
- Edet Joseph
- Igboke Hannah
- Ejike Uchenna
- Kunde Godfrey
- Fagbuyi Emmanuel
Weβre all students tackling high-frequency cryptocurrency market microstructure and liquidity stress detection.
The Plan
Static download: Grab historical trade data first, 5 key market periods covering COVID crash, bull runs, Terra collapse, post-FTX, and ETF approval (~30-50 GB total)
Live streaming: Continuous Level 2 order book snapshots every second for BTC/USDT, ETH/USDT, and SOL/USDT
Central S3 storage: Team members get read-only access via IAM
Budget & monitoring: Multi-threshold alerts and automatic stoppage to avoid surprises
Team IAM setup: Users in a group with policies attached for simplicity and safety
We used t3.micro EC2 in London for proximity and cost efficiency, running Python scripts in a virtual environment.
Part 1: EC2 Instance Setup
Instance Specifications:
Name:
Type: t3.micro (1 vCPU, 1GB RAM)
Storage: 20GB EBS (gp3)
Region: eu-west-2 (London - lowest latency to our team)
OS: Amazon Linux 2023
IAM Role:
Why London? Most of our team is in Europe/Africa, so this gave us the best latency.
Pro Tip: Always attach IAM roles to your EC2 instance during creation. Attaching them later sometimes requires stopping the instance, which interrupts your data collection.
IAM Role Policy for EC2:
The EC2 instance needs to write to S3, push CloudWatch metrics, and create SNS alerts:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::our-s3-bucket",
"arn:aws:s3:::our-s3-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"cloudwatch:PutMetricAlarm"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": "arn:aws:sns:*:*:binance-streamer-alerts"
}
]
}
Important: Using IAM roles instead of access keys means no hardcoded credentials, automatic key rotation, easier auditing, and zero risk of accidentally committing secrets to GitHub.
Part 2: AWS Budgeting (The Most Important Part!)
Before writing any code, we set up comprehensive budget controls. This is not optional for student projects.
Quick note on AWS Free Tier changes: AWS recently changed how their free tier works. You now get $100 in credits when you sign up, plus $20 credits for trying out specific services like Lambda, EC2, RDS, and AWS Budgets. The big change is that on the free tier, you cannot be charged beyond your credits. Itβs a hard limit. Once you hit zero credits and donβt top up, AWS emails you to upgrade and spins everything down if you donβt comply. Some services are still free within limits, but most usage now draws from your credits immediately.
Budget: $50 worth of credits per month (we started with $180 in total credits)
Our 5-Threshold Alert System:
| Threshold | Amount | Action |
|---|---|---|
| 1 | 50% ($25) | Email alert |
| 2 | 80% ($40) | Email + Slack notification |
| 3 | 90% ($45) | Email to all team members |
| 4 | 95% ($47.50) | Stop EC2 instance |
| 5 | 100% ($50) | Terminate EC2 instance |
Critical Setup Note: Thresholds 4 and 5 require the EC2 instance to exist before creating the budget. Both the budget and EC2 instance must be in the same AWS region, or the actions wonβt trigger. This step caught me initially and cost me an hour of debugging!
Part 3: Static Historical Data Download
Before firing up the live streamer script, we ran a one-off Python script to grab historical Binance trade data.
What We Downloaded:
We targeted 5 specific market periods for our liquidity stress research:
Window 1 - COVID Crash (Feb-Jul 2020): 6 months
Window 2 - Bull Run (Nov 2020-Apr 2021): 6 months
Window 3 - Terra Collapse (Nov 2021-Apr 2022): 6 months
Window 4 - Post-FTX (Nov 2022-Apr 2023): 6 months
Window 5 - ETF Approval (Jul-Dec 2024): 6 months
Assets: BTC/USDT, ETH/USDT, SOL/USDT (note: SOL wasnβt listed until Aug 2020)
Total: 44.6 GB compressed, 84 files
The script:
- Downloads from Binance Data Vision API
- Verifies SHA256 checksums
- Uploads directly to S3
- Deletes local copies to save disk space
- Runtime: ~15-20 minutes
binance_trade_data.py (excerpt)
import requests
import boto3
import hashlib
from tqdm import tqdm
s3_client = boto3.client('s3')
BASE_URL = "https://data.binance.vision/data/spot/monthly/trades"
ASSETS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
S3_BUCKET = "our-s3-bucket"
def download_and_upload(url, local_path, s3_key, checksum_url):
# Download with progress bar
response = requests.get(url, stream=True)
total_size = int(response.headers.get('content-length', 0))
with open(local_path, 'wb') as f:
with tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
pbar.update(len(chunk))
# Verify checksum
checksum_response = requests.get(checksum_url)
expected_checksum = checksum_response.text.strip().split()[0]
sha256_hash = hashlib.sha256()
with open(local_path, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
sha256_hash.update(chunk)
if sha256_hash.hexdigest() == expected_checksum:
print("β Checksum verified")
# Upload to S3
s3_client.upload_file(local_path, S3_BUCKET, s3_key)
os.remove(local_path) # Delete local copy
return True
else:
print("β Checksum mismatch!")
return False
This gave us a clean historical baseline before starting the live stream.
Why these specific periods? Each window represents a major market event: COVID crash (liquidity crisis), Bull run (FOMO), Terra collapse (contagion), FTX collapse (systemic failure), and ETF approval (institutional adoption). Perfect for studying liquidity under stress.
Part 4: Live L2 Order Book Streamer
After the static data, we started the real-time snapshots using Python with multi-threading.
Streaming Configuration:
Interval: 1-second snapshots per symbol
Depth: Top 100 bid/ask levels
Symbols: BTC/USDT, ETH/USDT, SOL/USDT (running concurrently in separate threads)
API: Binance REST API (not WebSocket - simpler and more reliable for our use case)
Output: S3 bucket with Hive-style partitioning
Why REST API instead of WebSocket?
- Simpler implementation
- 1-second granularity is sufficient for liquidity research
- Complete order book snapshots (not deltas)
- Automatic retries and error handling
- No re-connection logic needed
S3 Data Structure:
s3://our-s3-bucket/binance-l2-data/
βββ symbol=BTCUSDT/
β βββ year=2025/
β βββ month=10/
β βββ day=22/
β βββ hour=10/
β βββ 20251022_100001_123456.json
β βββ 20251022_100002_234567.json
βββ symbol=ETHUSDT/
βββ symbol=SOLUSDT/
Each JSON file contains:
- Full order book (100 levels)
- Best bid/ask prices
- Calculated metrics (spread, liquidity, imbalance)
- Timestamp metadata
Streamer Core Logic
import json
import time
import boto3
import requests
import threading
from datetime import datetime
class BinanceL2Streamer:
def __init__(self, symbol, s3_bucket, depth=100, interval=1):
self.symbol = symbol
self.s3_bucket = s3_bucket
self.depth = depth
self.interval = interval
self.s3 = boto3.client('s3', region_name='eu-west-2')
self.cloudwatch = boto3.client('cloudwatch', region_name='eu-west-2')
self.api_url = "https://api.binance.com/api/v3/depth"
def get_order_book(self):
"""Fetch current order book from Binance REST API"""
params = {'symbol': self.symbol, 'limit': self.depth}
response = requests.get(self.api_url, params=params, timeout=5)
response.raise_for_status()
return response.json()
def enrich_snapshot(self, raw_data):
"""Calculate derived metrics"""
bids = raw_data.get('bids', [])
asks = raw_data.get('asks', [])
best_bid = float(bids[0][0]) if bids else 0
best_ask = float(asks[0][0]) if asks else 0
spread = best_ask - best_bid
mid_price = (best_bid + best_ask) / 2
# Calculate liquidity at different depths
bid_liquidity_5 = sum(float(b[0]) * float(b[1]) for b in bids[:5])
ask_liquidity_5 = sum(float(a[0]) * float(a[1]) for a in asks[:5])
return {
'symbol': self.symbol,
'timestamp': datetime.utcnow().isoformat(),
'bids': bids,
'asks': asks,
'metrics': {
'mid_price': mid_price,
'spread': spread,
'spread_bps': (spread / mid_price * 10000) if mid_price else 0,
'bid_liquidity_5': bid_liquidity_5,
'ask_liquidity_5': ask_liquidity_5,
'imbalance_5': (bid_liquidity_5 - ask_liquidity_5) /
(bid_liquidity_5 + ask_liquidity_5)
}
}
def upload_to_s3(self, data):
"""Upload with time-based partitioning"""
timestamp = datetime.utcnow()
s3_key = (
f"binance-l2-data/symbol={self.symbol}/"
f"year={timestamp.year}/month={timestamp.month:02d}/"
f"day={timestamp.day:02d}/hour={timestamp.hour:02d}/"
f"{timestamp.strftime('%Y%m%d_%H%M%S_%f')}.json"
)
self.s3.put_object(
Bucket=self.s3_bucket,
Key=s3_key,
Body=json.dumps(data),
ContentType='application/json'
)
def run(self):
"""Main streaming loop"""
while True:
raw_data = self.get_order_book()
enriched_data = self.enrich_snapshot(raw_data)
self.upload_to_s3(enriched_data)
time.sleep(self.interval)
# Run all three symbols in parallel threads
SYMBOLS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
threads = []
for symbol in SYMBOLS:
streamer = BinanceL2Streamer(symbol, "our-s3-bucket")
thread = threading.Thread(target=streamer.run, daemon=True)
thread.start()
threads.append(thread)
# Keep main thread alive
while True:
time.sleep(1)
The script also pushes custom CloudWatch metrics every 60 seconds:
- Snapshots captured
- Error count
- Spread (basis points)
- Order book imbalance
- Bid/ask liquidity
- Heartbeat (liveness check)
Part 5: Team IAM Access Setup
To keep things safe but shareable, we used IAM user groups with read-only policies.
The Process:
- Created IAM user group: BinanceReadOnlyTeam
- Created 6 individual IAM users (one per team member)
- Added all users to the group
- Attached read-only S3 policy to the group
- Generated access keys for each user
- Sent keys via email with a Jupyter notebook showing how to use them in Google Colab
Read-Only S3 Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListBucket",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::our-s3-bucket"
},
{
"Sid": "ReadObjects",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::our-s3-bucket/*"
}
]
}
What team members CAN do:
- List files in S3
- Download data to Google Colab/local machines
- Read all historical and live data
What team members CANNOT do:
- Upload/modify/delete data
- Create EC2 instances
- See billing information
- Modify IAM permissions
- Access other AWS services
This setup meant everyone had access without risking accidental writes or exposing the main EC2 credentials. Individual accountability with unique keys per person, easy revocation by removing from group, centralized management where one policy change affects everyone, and complete audit trail via CloudTrail.
Part 6: Automation with Bash & Systemd
To make the system fully autonomous, we created bash scripts and systemd services.
Script 1: create_alarms.sh
Creates all CloudWatch alarms and SNS topic:
#!/bin/bash
SYMBOLS=("BTCUSDT" "ETHUSDT" "SOLUSDT")
REGION="eu-west-2"
SNS_TOPIC_ARN="arn:aws:sns:<region>:<account-id-redacted>:<topic>"
# Create SNS topic
aws sns create-topic --name binance-streamer-alerts --region $REGION
# Subscribe email
aws sns subscribe \
--topic-arn $SNS_TOPIC_ARN \
--protocol email \
--notification-endpoint your-email@example.com \
--region $REGION
# Create alarms for each symbol
for SYMBOL in "${SYMBOLS[@]}"; do
# High error rate alarm
aws cloudwatch put-metric-alarm \
--alarm-name "BinanceStreamer-HighErrorRate-${SYMBOL}" \
--namespace "BinanceStreamer" \
--metric-name "ErrorCount" \
--dimensions Name=Symbol,Value=$SYMBOL \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--alarm-actions $SNS_TOPIC_ARN \
--region $REGION
# No data received alarm
aws cloudwatch put-metric-alarm \
--alarm-name "BinanceStreamer-NoData-${SYMBOL}" \
--namespace "BinanceStreamer" \
--metric-name "SnapshotsCaptured" \
--dimensions Name=Symbol,Value=$SYMBOL \
--statistic SampleCount \
--period 300 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--alarm-actions $SNS_TOPIC_ARN \
--treat-missing-data breaching \
--region $REGION
done
Script 2: setup_binance_services.sh
Creates systemd services for automatic startup and restarts:
#!/bin/bash
# Create streamer service
sudo tee /etc/systemd/system/binance-streamer.service > /dev/null <<EOF
[Unit]
Description=Binance L2 Order Book Streamer
After=network-online.target
[Service]
Type=simple
User=ec2-user
WorkingDirectory=/home/ec2-user
ExecStart=/home/ec2-user/trade_download/bin/python /home/ec2-user/binance_l2_streamer.py
Restart=always
RestartSec=10
MemoryMax=512M
CPUQuota=50%
Environment="PYTHONUNBUFFERED=1"
[Install]
WantedBy=multi-user.target
EOF
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable binance-streamer.service
sudo systemctl start binance-streamer.service
With Restart=always and RestartSec=10, our streamer becomes virtually immortal. Script crashes? Auto-restart in 10 seconds. EC2 reboots? Starts automatically on boot. Zero manual intervention needed.
Monitoring Commands:
# View live logs
sudo journalctl -u binance-streamer.service -f
# Check service status
sudo systemctl status binance-streamer.service
# Restart if needed
sudo systemctl restart binance-streamer.service
Part 7: CloudWatch Monitoring & Alerts
We created comprehensive monitoring with 7 alarms per symbol (21 total):
- High Error Rate - More than 10 errors in 5 minutes
- No Data Received - No snapshots in 5 minutes
- High Spread - Spread > 50 bps (potential liquidity crisis)
- Extreme Order Book Imbalance - 70% imbalance for 2 minutes
- EC2 Status Check Failed - Instance health issues
- High CPU Usage - Above 80% for 10 minutes
- Low Disk Space - Disk usage above 80%
All alarms send SNS email notifications to 2 members for redundancy (me and Joseph).
Example alarm trigger email:
ALARM: "BinanceStreamer-HighErrorRate-BTCUSDT" in EU (London)
Threshold Crossed: ErrorCount > 10 for 2 datapoints within 10 minutes
Current State: ALARM
Previous State: OK
Reason: Threshold Crossed: 1 datapoint [15.0] was greater than
the threshold (10.0).
This gives us peace of mind. We know immediately if anything goes wrong.
Part 8: Actual Costs & Results
After 72 Hours (3 Days) of Operation:
| Metric | Value |
|---|---|
| Total snapshots | 259,200 per symbol (777,600 total) |
| S3 storage | 2.6 GB live data + 44.6 GB historical = 47.2 GB |
| Uptime | 100% (0 errors) |
| Files created | ~384,000 JSON files |
| Avg file size | 7.5 KB per snapshot |
Cost Breakdown:
- EC2 (t3.micro, 72 hours): $0.93 worth of credits
- S3 storage (47.2 GB): $1.08 worth of credits
- S3 PUT requests: $0.05 worth of credits
- Data transfer: $0.02 worth of credits
- Total for 3 days: $2.08 worth of credits
Projected Monthly Cost: ~$10 worth of credits per month
Well under our $50 budget, and remember, on the free tier you canβt actually be charged real money. Once your credits hit zero, AWS just emails you and shuts things down unless you upgrade. Itβs a hard limit.
Architecture Diagram
Hereβs the complete system architecture showing the full data flow from Binance API through our EC2 instance to S3 storage, with CloudWatch monitoring and IAM-controlled team access.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Binance REST API |
β (Order book snapshots for BTC/ETH/SOL every 1 second) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EC2 Instance (t3.micro, eu-west-2) β
β βββββββββββββββββββββββββββββββββββββββββββββββββ |
β β Python Streamer (3 threads) β |
β β β’ binance_l2_streamer.py β |
β β β’ Systemd auto-restart β |
β β β’ IAM role: <ec2-s3-role> β |
β βββββββββββββββββββββββββββββββββββββββββββββββββ |
ββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββββ
β β
β ββββββββββββββββ
βΌ βΌ
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββ
β S3: our-s3-bucket β β CloudWatch Metrics β
β β’ Historical: 44.6 GB β β β’ Snapshots/min β
β β’ Live L2: 2.6 GB β β β’ Error count β
β β’ Time-partitioned β β β’ Spread, liquidity β
β β’ 777K+ snapshots β β β’ Order book balance β
ββββββββββββββ¬ββββββββββββββββ ββββββββββββ¬ββββββββββββββ
β β
β βΌ
β βββββββββββββββββββββββββββ
β β CloudWatch Alarms β
β β β’ 21 alarms total β
β β β’ 7 per symbol β
β ββββββββββββ¬βββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββ
β IAM User Group β β SNS Topic β
β BinanceReadOnlyTeam β β streamer-alerts-topic β
β β’ 6 team members β β β
β β’ Read-only S3 access β β β’ Email notifications β
β β’ Individual access keys β β β’ 2 team members β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Team Analysis (Google Colab / Local) β
β β’ Pandas DataFrames β
β β’ Liquidity stress analysis β
β β’ Market microstructure research β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lessons Learned
Hereβs what actually happened during our 3-day journey from idea to production system.
What Worked Really Well:
- Budget-first approach - Set up cost controls BEFORE infrastructure = zero surprises
- IAM roles for EC2 - No credential management headaches at all
- User groups for team - Adding/removing members took literally seconds
- Systemd for reliability - Scripts survived EC2 restarts, no manual intervention
- Time-based S3 partitioning - Made data queries fast and cheap
- REST over WebSocket - Simpler code, fewer bugs, good enough for our needs
Challenges We Hit:
| Challenge | What Went Wrong | Solution | Time Lost |
|---|---|---|---|
| Budget actions didnβt trigger | Created budget before EC2 existed | Created EC2 first, then recreated budget in same region | 1 hour |
| Disk space filled up in 6 hours | Kept local copies after S3 upload | Added os.remove(local_path) after upload | 30 min |
| CloudWatch metrics not appearing | Used wrong region in boto3 client | Changed region_name='us-east-1' to 'eu-west-2' | 20 min |
| Systemd service wouldnβt start | Wrong Python interpreter path | Fixed ExecStart to use venv Python: /home/ec2-user/trade_download/bin/python | 15 min |
| IAM permissions initially too broad | Gave EC2 role full S3 access | Restricted to specific bucket ARN only | 45 min |
Debugging Tip: 90% of our issues were solved by checking sudo journalctl -u binance-streamer.service -n 50. Always check logs first before assuming AWS is broken!
Part 9: Making the Data Research-Ready
Getting raw data is only half the battle. We needed to transform 44.6 GB of compressed historical data and continuous JSON snapshots into something you could actually query and analyze.
The full data processing pipeline deserves its own article (unzipping historical data, converting JSON to Parquet, AWS Athena setup, feature engineering, etc.), but hereβs what we built:
The transformation: Raw JSON β Parquet (60% storage reduction, 10-20x faster queries)
The interface: AWS Athena with time-based partitioning (query 6 months in seconds)
The features: Pre-computed metrics like log_returns, volume_imbalance, effective_spread, and liquidity_score
Result: Team members can run complex queries across the entire dataset without downloading anything locally. What would take minutes of local processing now happens in seconds via Athena.
This data processing pipeline will be covered in detail in Part 2, along with our feature engineering approach and the actual liquidity stress detection models.
What Weβre Researching
Our project focuses on liquidity stress detection in cryptocurrency markets using order book microstructure.
Research Questions:
- Can we predict liquidity crises before they happen by analyzing order book imbalance?
- How does liquidity behave differently across BTC, ETH, and SOL during stress events?
- Are there early warning signals in bid-ask spreads during major market moves?
- How quickly does liquidity recover after major crashes (COVID, Terra, FTX)?
Methodology:
- Compare L2 snapshots during βnormalβ vs βstressβ periods
- Analyze spread widening, order book depth depletion, and imbalance patterns
- Build predictive models using machine learning on microstructure features
- Validate findings across all three assets and five historical windows
This dataset will enable us (and the broader research community) to study these questions with unprecedented granularity.
Full System Workflow
Hereβs what happens after initial setup (completely hands-off):
- EC2 boots β systemd starts binance-streamer.service
- Python script launches β 3 threads (BTC, ETH, SOL) start streaming
- Every second β Each thread fetches order book, enriches data, uploads to S3
- Every 60 seconds β CloudWatch metrics pushed (snapshots, errors, spread, etc.)
- CloudWatch monitors β Checks all 21 alarms continuously
- If alarm triggers β SNS sends email to team leads immediately
- If script crashes β Systemd auto-restarts after 10 seconds
- If EC2 reboots β Everything starts automatically on boot
- Team members β Download data anytime via read-only IAM credentials
Total manual intervention required: Zero
Current Data Collection Status
As of writing this article:
Live Streaming (Day 3):
- 777,600 total snapshots captured
- 0 errors (100% uptime)
- 2.6 GB of live L2 data
- All CloudWatch alarms functioning
- Team members accessing data successfully
Historical Data:
- 44.6 GB across 5 market periods
- 84 monthly files downloaded and verified
- All checksums validated
Projected 6-Month Collection:
- ~47 million snapshots per symbol (141 million total)
- ~50-80 GB additional live data
- Total dataset: ~100-130 GB
Code Repository
All scripts (Python + Bash) and setup documentation will be released on GitHub:
GitHub: Coming soon (will include complete setup guide, all scripts, and infrastructure templates)
What will be included:
- binance_l2_streamer.py - Multi-threaded live streamer
- binance_trade_data.py - Historical data downloader
- create_alarms.sh - CloudWatch alarm automation
- setup_binance_services.sh - Systemd service installer
- IAM policy templates (JSON)
- Complete setup documentation
- Team access guide (Google Colab notebook)
Making It Open Source
Once our research concludes (~6 months), weβll release:
1. The complete dataset (~50-100 GB)
- Historical trade data (5 market periods)
- 6 months of live L2 snapshots
- All three symbols (BTC, ETH, SOL)
2. All infrastructure code (might come earlier)
- Terraform templates for one-click deployment
- Docker containers (alternative to EC2)
- Complete setup scripts
3. Analysis notebooks
- Liquidity stress detection algorithms
- Market microstructure analysis
- Visualization dashboards
4. Research paper
- Methodology
- Findings
- Reproducibility instructions
Why? High-frequency financial data is expensive and gatekept. Open-source data democratizes quantitative finance research for students and independent researchers.
How You Can Replicate This
Everything is designed to be plug-and-play. If you can SSH into a server and run a bash script, you can replicate this entire system.
Requirements:
- AWS account with Free Tier ($100 credits for new accounts, plus $20 bonus credits per service)
- Basic Python knowledge
- SSH access to Linux
- 30-60 minutes for initial setup
Estimated Costs:
- First few months: Draws from your $100-140 in credits
- Monthly credit usage: ~$8-12 worth
- One-time setup time: 1-2 hours
Steps:
- Clone our GitHub repo (link coming soon)
- Follow the setup guide in README.md
- Run setup_binance_services.sh on your EC2 instance
- Configure your budget and alarms
- Start streaming!
The entire infrastructure is designed to be plug-and-play reproducible.
Want to Collaborate?
Weβre always interested in collaborating with other researchers working on:
- Market microstructure
- Cryptocurrency liquidity
- High-frequency data analysis
- Open-source finance datasets
Reach out if you:
- Want to use our dataset for your research
- Have ideas for improving the infrastructure
- Want to contribute to the open-source release
- Are building similar systems and want to share knowledge
Closing Thoughts
Building this system taught us that good data infrastructure doesnβt require a huge budget or enterprise tools. It requires clear thinking about requirements and constraints, careful planning (especially around costs), simple and reliable tools over complex ones, good documentation for reproducibility, and team collaboration with clear ownership.
The hardest part wasnβt the code. It was figuring out the architecture, IAM policies, budget controls, and making everything work together reliably.
If six students can build a production-grade market data pipeline on $10/month worth of credits, you can too.
Thanks for reading!
This article is part of our ongoing open-source research project. Star our GitHub repo when itβs released to stay updated on the dataset publication and research findings!