AudioGhost AI π΅π»
AI-Powered Object-Oriented Audio Separation
Describe the sound you want to extract or remove using natural language. Powered by Metaβs SAM-Audio model.
π¬ Demo
audioghost.mp4
Features
- π― Text-Guided Separation - Describe what you want to extract: "vocals", "drums", "a dog barking"
- π Memory Optimized - Lite mode reduces VRAM from ~11GB to ~4GB
- π¨ Modern UI - Glassmorphism design with waveform visualization
- β‘ Real-time Progress - Track separation progress in real-time
- ποΈ Stem Mixer - Preview and compare original, extracted, and residual audio
πΊοΈ Roadmap
- π¬ Video Support - Upload videos and separate audio sources visually
- π±οΈ Visual Prompting - Click on video toβ¦
AudioGhost AI π΅π»
AI-Powered Object-Oriented Audio Separation
Describe the sound you want to extract or remove using natural language. Powered by Metaβs SAM-Audio model.
π¬ Demo
audioghost.mp4
Features
- π― Text-Guided Separation - Describe what you want to extract: "vocals", "drums", "a dog barking"
- π Memory Optimized - Lite mode reduces VRAM from ~11GB to ~4GB
- π¨ Modern UI - Glassmorphism design with waveform visualization
- β‘ Real-time Progress - Track separation progress in real-time
- ποΈ Stem Mixer - Preview and compare original, extracted, and residual audio
πΊοΈ Roadmap
- π¬ Video Support - Upload videos and separate audio sources visually
- π±οΈ Visual Prompting - Click on video to select sound sources (Integration with SAM 3)
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend β
β (Next.js + Tailwind v4) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β Backend API β
β (FastAPI + Python) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β Task Queue β
β (Celery + Redis) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β SAM Audio Lite β
β (Memory-optimized Meta SAM-Audio) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Requirements
- Python 3.11+
- CUDA-compatible GPU (4GB+ VRAM for lite mode, 12GB+ for full mode)
- CUDA 12.6 (recommended)
- Node.js 18+ (for frontend)
π‘ FFmpeg and Redis are automatically installed by the installer.
π One-Click Installation (Recommended)
First Time Setup
# Run installer (creates Conda env, downloads Redis, installs all dependencies)
install.bat
Daily Usage
# Start all services with one click
start.bat
# Stop all services
stop.bat
Manual Setup (Advanced)
1. Start Redis
Redis is automatically downloaded to redis/ folder by install.bat. If you prefer Docker:
docker-compose up -d
2. Create Anaconda Environment
# Create new environment (Python 3.11+ required)
conda create -n audioghost python=3.11 -y
# Activate environment
conda activate audioghost
3. Install PyTorch (CUDA 12.6)
pip install torch==2.9.0+cu126 torchvision==0.24.0+cu126 torchaudio==2.9.0+cu126 --index-url https://download.pytorch.org/whl/cu126 --extra-index-url https://pypi.org/simple
4. Install FFmpeg (required by TorchCodec)
conda install -c conda-forge ffmpeg -y
5. Install SAM Audio
pip install git+https://github.com/facebookresearch/sam-audio.git
6. Install Backend Dependencies
cd backend
pip install -r requirements.txt
7. Install Frontend Dependencies
cd frontend
npm install
8. Start Services
Terminal 1 - Backend API:
cd backend
uvicorn main:app --reload --port 8000
Terminal 2 - Celery Worker:
conda activate audioghost
cd backend
celery -A workers.celery_app worker --loglevel=info --pool=solo
Terminal 3 - Frontend:
cd frontend
npm run dev
9. Open the App
Navigate to http://localhost:3000
10. Connect HuggingFace
- Click "Connect HuggingFace" button
- Request access at https://huggingface.co/facebook/sam-audio-large
- Create Access Token: https://huggingface.co/settings/tokens
- Paste the token and connect
Usage
- Upload an audio file (MP3, WAV, FLAC)
- Describe what you want to extract or remove:
- "vocals" / "singing voice"
- "drums" / "percussion"
- "background music"
- "a dog barking"
- "crowd noise"
- Click Extract or Remove
- Wait for processing
- Preview and download the results
Performance Benchmarks
Tested on RTX 4090 with 4:26 audio (11 chunks @ 25s each)
VRAM Usage (Lite Mode)
| Model | bfloat16 (Default) | float32 (High Quality) | Recommended GPU |
|---|---|---|---|
| Small | ~6 GB | ~10 GB | RTX 3060 6GB / RTX 3070 8GB |
| Base | ~7 GB | ~13 GB | RTX 3070/4060 8GB / RTX 4070 12GB |
| Large | ~10 GB | ~20 GB | RTX 3080/4070 12GB / RTX 4080 16GB |
π‘ High Quality Mode (float32): Better separation quality but uses +2-3GB more VRAM. Enable via the "High Quality Mode" toggle in the UI.
Processing Time
| Model | First Run (incl. model load) | Subsequent Runs | Speed |
|---|---|---|---|
| Small | ~78s | ~25s | ~10x realtime |
| Base | ~100s | ~29s | ~9x realtime |
| Large | ~130s | ~41s | ~6.5x realtime |
π‘ First run includes model download and loading. Subsequent runs use cached models.
Memory Optimization Details
AudioGhost uses a "Lite Mode" that removes unused model components:
| Component Removed | VRAM Saved |
|---|---|
| Vision Encoder | ~2GB |
| Visual Ranker | ~2GB |
| Text Ranker | ~2GB |
| Span Predictor | ~1-2GB |
Total Reduction: Up to 40% less VRAM compared to original SAM-Audio
This is achieved by:
- Disabling video-related features (not needed for audio-only)
- Using
predict_spans=Falseandreranking_candidates=1 - Using
bfloat16precision by default (optional float32 for quality) - 25-second chunking for long audio files
Project Structure
audioghost-ai/
βββ backend/
β βββ main.py # FastAPI app
β βββ api/ # API routes
β β βββ auth.py # HuggingFace auth
β β βββ separate.py # Separation endpoints
β βββ workers/
β βββ celery_app.py # Celery config
β βββ tasks.py # SAM Audio Lite worker
βββ frontend/
β βββ src/
β β βββ app/ # Next.js app
β β βββ components/ # React components
β βββ package.json
βββ sam_audio_lite.py # Standalone lite version
βββ QUICKSTART.md # Quick setup guide
βββ README.md
API Reference
POST /api/separate/
Create a separation task.
Form Data:
file- Audio filedescription- Text prompt (e.g., "vocals")mode- "extract" or "remove"model_size- "small", "base", or "large" (default: "base")
Response:
{
"task_id": "uuid",
"status": "pending",
"message": "Task submitted successfully"
}
GET /api/separate/{task_id}/status
Get task status and progress.
GET /api/separate/{task_id}/download/{stem}
Download result audio (ghost, clean, or original).
Troubleshooting
CUDA Out of Memory
- Use
model_size: "small"instead of "base" or "large" - Ensure lite mode is enabled (check for "Optimizing model for low VRAM" in logs)
- Close other GPU applications
TorchCodec DLL Error
- Downgrade to FFmpeg 7.x
- Ensure FFmpeg
bindirectory is in PATH
HuggingFace 401 Error
- Re-authenticate via the UI
- Check that
.hf_tokenexists inbackend/
License
This project is licensed under the MIT License. SAM-Audio is licensed by Meta under a research license.
Credits
- SAM-Audio by Meta AI Research
- Core Optimization Logic: Special thanks to NilanEkanayake for providing the initial code modifications in Issue #24 that made VRAM inference reduction possible.
- Built with β€οΈ using Next.js, FastAPI, and Celery