Gameplay-Vision-LLM (open-source): long-horizon gameplay video understanding + causal reasoning — can you review it and rate it 1–10?

Multimodal Gameplay Video Understanding with Vision-Language Models

A research framework for multimodal video understanding and question-answering on gameplay footage, combining state-of-the-art vision encoders, audio processing, and large language models with trained projection adapters. Article I made describing what I am trying to do: https://medium.com/@cmetoyerbusiness/towards-a-cascaded-multimodal-pipeline-for-long-horizon-gameplay-analysis-25ed6a8630c9

Trained Weights

Download the trained adapters from Hugging Face: https://huggingface.co/cjm249/gameplay-vision-llm-adapters

Abs…

Multimodal Gameplay Video Understanding with Vision-Language Models

Trained Weights

Download the trained adapters from Hugging Face: https://huggingface.co/cjm249/gameplay-vision-llm-adapters

Abstract

This project implements a multimodal perception-reasoning pipeline for analyzing gameplay videos. The system integrates visual perception (SAM3, SigLIP), temporal understanding (VideoMAE), audio processing (Wav2Vec2, Whisper), and text extraction (OCR) with a vision-language model (Qwen3-VL-8B-Instruct) through learned projection layers. The architecture enables natural language question-answering about video content by projecting heterogeneous perceptual embeddings into a unified representation space compatible with the language model’s hidden dimensions.

Project Validation (Verified Capabilities)

The final deployment validates the project’s ability to perform long-horizon reasoning by combining vision, temporal, and textual facts across extended timelines.

Multimodal Alignment Success: The system successfully integrated heterogeneous encoder features (SigLIP: 1152-dim, VideoMAE: 768-dim, Wav2Vec2: 1024-dim) into the Qwen LLM’s 4096-dimension latent space using the trained ProjectorBank.
Causal Reasoning Verified: LoRA fine-tuning successfully enabled the model to perform structured strategic analysis and answer complex ‘why’ questions, such as linking player actions (e.g., maximum Overcharge application) to subsequent game state changes (e.g., the BROKEN state) [A: 172, A: 175, A: 549].
Temporal Synthesis: The system demonstrated the ability to synthesize detailed, chronological summaries of events spanning minutes of gameplay by retrieving context from the indexed timeline.

Architecture

Perception Pipeline Components

Encoder	Model	Output Dimension	Purpose
SAM3	facebook/sam3	Segmentation masks	Entity detection and localization
SigLIP	google/siglip2-so400m-patch14-384	1152-dim	Semantic visual embeddings
VideoMAE	MCG-NJU/videomae-base	768-dim	Temporal video understanding
Wav2Vec2	facebook/wav2vec2-large	1024-dim	Audio feature extraction
Whisper	openai/whisper-base	Text	Speech-to-text transcription
PaddleOCR	PaddlePaddle	Text	On-screen text extraction

Projection Layer

Learned MLP projectors map heterogeneous encoder outputs to the LLM’s hidden space (4096-dim):

class MultiModalProjector(nn.Module):
def __init__(self, input_dim, llm_dim=4096):
self.proj = nn.Sequential(
nn.Linear(input_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim),
)

Fusion and Indexing

The project utilizes a Hybrid Retrieval system for context fetching, which is critical for long-video understanding:

Time-Based Retrieval: Used when the user provides an explicit timestamp (e.g., ‘@00:45’), retrieving events within a defined window.
Semantic Retrieval: For general queries (‘what happened here?’), the system uses the all-MiniLM-L6-v2 embedder to find the top $K$ most relevant events in the entire timeline index [A: 429, A: 437, A: 517].

Reasoning Core

Base Model: Qwen/Qwen3-VL-8B-Instruct
Attention: Flash Attention 2
Fine-tuning: LoRA adapters (r=16, alpha=32)
Precision: bfloat16

Installation

Tested Environment

This project has been tested on:

RunPod Image: runpod/pytorch:2.8.0-py3.12-cuda12.8.0-ubuntu24.04
Python: 3.12+
CUDA: 12.8+
GPU: NVIDIA H200, A100 (40GB+ recommended)

Quick Start (Recommended)

Use the automated setup script which handles all dependency ordering and known issues:

# Clone repository
git clone https://github.com/chasemetoyer/gameplay-vision-llm.git
cd gameplay-vision-llm

# Run the setup script (handles everything)
chmod +x setup_env.sh
./setup_env.sh

# Download trained weights from Hugging Face
python -c "from huggingface_hub import snapshot_download; snapshot_download('cjm249/gameplay-vision-llm-adapters', local_dir='outputs')"

The setup_env.sh script:

Installs PyTorch and build dependencies first
Installs Flash Attention from pre-built wheel
Installs core dependencies from requirements-core.txt
Installs PaddlePaddle GPU 3.2.0 from official Paddle wheel index
Restores PyTorch CUDA libraries (fixes conflicts)
Verifies all installations

Run Inference

# With a local video (full processing with SAM detection)
python scripts/realtime_inference.py \
--video "/path/to/your/gameplay.mp4" \
--use-sam \
--interactive

# With a YouTube URL
python scripts/realtime_inference.py \
--video "https://www.youtube.com/watch?v=VIDEO_ID" \
--use-sam \
--interactive

# Without SAM3 (faster processing)
python scripts/realtime_inference.py \
--video "/path/to/your/gameplay.mp4" \
--interactive

Manual Installation (Alternative)

If you prefer manual installation:

# 1. Install PyTorch first (required for Flash Attention)
pip install torch torchvision torchaudio accelerate

# 2. Install Flash Attention from pre-built wheel
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

# 3. Install core dependencies
pip install -r requirements-core.txt

# 4. Install PaddleOCR with GPU (from official Paddle source)
python3 -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
pip install paddleocr

Project Structure

gameplay-vision-llm/
├── README.md                      # This file
├── requirements.txt               # Full frozen dependencies
├── requirements-core.txt          # Core dependencies with min versions
├── pyproject.toml                 # Project metadata
│
├── src/                           # Source code
│   ├── agent_core/                # Core reasoning pipeline
│   │   └── qwen_reasoning_core.py # PerceptionReasoningLoop, ProjectorBank
│   ├── perception/                # Visual perception modules
│   │   ├── sam_concept_segmenter.py
│   │   ├── siglip_semantic_encoder.py
│   │   └── ocr_pipeline.py
│   ├── audio/                     # Audio processing
│   │   └── qwen_audio_processor.py
│   ├── temporal/                  # Temporal modeling
│   │   └── internvideo_hico_module.py
│   └── fusion_indexing/           # Timeline and retrieval
│       ├── timeline_indexer.py
│       └── knowledge_base_builder.py
│
├── scripts/                       # Executable scripts
│   ├── realtime_inference.py      # Main interactive inference
│   ├── extract_features.py        # Feature extraction pipeline
│   ├── train_projectors.py        # Projector training
│   ├── finetune_lora.py           # LoRA fine-tuning
│   └── demo_projector_inference.py
│
├── outputs/                       # Model outputs
│   ├── projector_weights.pt       # Trained projector weights
│   └── lora_adapter/              # LoRA adapter weights
│
├── data/                          # Data directory
│   ├── raw_videos/                # Input video files
│   ├── training/                  # Training data (Q&A pairs)
│   └── outputs/                   # Extracted features
│
├── docs/                          # Documentation
└── tests/                         # Unit tests

Usage

Real-Time Inference

Interactive question-answering on gameplay videos:

# Local video file with full processing
python scripts/realtime_inference.py \
--video path/to/gameplay.mp4 \
--use-sam \
--interactive

# YouTube video (auto-download)
python scripts/realtime_inference.py \
--video "https://youtube.com/watch?v=..." \
--use-sam \
--interactive

# Without SAM3 (faster, less accurate)
python scripts/realtime_inference.py \
--video path/to/gameplay.mp4 \
--interactive

Interactive Commands

During interactive mode:

@<MM:SS> <question>  - Ask about specific timestamp
<question>           - Ask about whole video
quit                 - Exit

Feature Extraction

Extract features for training or analysis:

python scripts/extract_features.py \
--video path/to/video.mp4 \
--output data/outputs \
--use-sam \
--fps 1.0

Training

LoRA Fine-tuning

python scripts/finetune_lora.py \
--data-dir data/training \
--output-dir outputs/lora_adapter \
--epochs 3 \
--lr 2e-4

Projector Training

python scripts/train_projectors.py \
--embeddings-dir data/outputs \
--lora-path outputs/lora_adapter \
--output-dir outputs \
--epochs 5

Training Methodology

LoRA Adapter Training

The Qwen3-VL model is fine-tuned using Low-Rank Adaptation on gameplay Q&A pairs:

Target Modules: q_proj, k_proj, v_proj, o_proj
Rank: 16
Alpha: 32
Learning Rate: 2e-4

Projector Training

The projection layers (Linear → GELU → Linear) are trained with a Generative Alignment Objective while keeping the LLM frozen. This objective utilizes Mean Squared Error (MSE) to optimize the projectors so that the norm (magnitude) of the projected embeddings approaches a target value (specifically, $\sqrt{\text{LLM_hidden_dim}}$), ensuring semantic compatibility with the Qwen LLM.

The LLM weights remain frozen; gradients flow only through projection layers.

Memory Requirements

Component	VRAM (bfloat16)
Qwen3-VL-8B-Instruct	~16 GB
SAM3	~4 GB
SigLIP	~2 GB
VideoMAE	~1 GB
Wav2Vec2/Whisper	~1 GB
Total	~24 GB

Recommended: NVIDIA A100 (40/80 GB) or H100

Limitations

Real-time Processing Bottleneck: The current latency for generating full perception features is severely limited by the Segmentation and Masking model.
SAM3 Detection Speed: Processing currently averages ~3.25 to 3.36 seconds per frame for full detection [A: 478, A: 10]. This prevents true real-time analysis and necessitates the implementation of cascaded processing for efficiency.
Whisper transcription adds processing time for audio-heavy content
OCR accuracy depends on video resolution and text clarity

Future Work

High Priority

Cascaded Processing and Efficiency

Implement Trigger Detector: Integrate the `TriggerDetector` mechanism to enable selective analysis (cascaded processing). This system must monitor perception outputs (e.g., SAM3 detecting a ‘boss’ or Qwen2-Audio detecting an ‘explosion’) and only activate the high-cost reasoning core (Qwen LLM) when a significant, high-confidence event is detected [A: 147, A: 163, A: 418].

Integrate Temporal Context Management (HiCo): Activate the `TemporalContextManager` to use Hierarchical Token Compression (HiCo), ensuring the LLM receives a continuous, rolling compressed context representing the last 5–10 minutes of video via VideoMAE embeddings. This maintains long-range causal awareness while keeping token consumption low [A: 299, A: 405, A: 425].

Entity-Centric Knowledge Base: Fully utilize the `KnowledgeBaseBuilder` to ingest structured facts (entity IDs, state changes, bounding boxes) extracted by SAM3, transforming raw detections into explicit causal linkages for the LLM to reason over [A: 142, A: 535].

SigLIP Inference Speed

Batch encode multiple regions simultaneously
Use FP16/INT8 quantization for faster inference
Implement async encoding with prefetching
Explore SigLIP-Base for speed vs accuracy tradeoff

Multi-GPU Parallelization

Pipeline parallelism: run SAM3, SigLIP, OCR, etc. on separate GPUs
Data parallelism: split frames across GPUs for same model
Async frame queues between pipeline stages
Target 3-5x speedup with 4 GPUs

Causal Link Extraction

Explicit action→effect pairing from timeline events
Game state tracking (HP, mana, cooldowns)
Rule-based causal graph construction
Train causal reasoning module on gameplay data

Timeline Enrichment

Integrate game-specific entity recognition
Add damage number parsing from OCR
Track character positions across frames
Build entity relationship graphs

Medium Priority

Streaming Inference

Real-time processing during video playback
Incremental timeline updates
Lower-latency response generation

Multi-Language Support

Extend Whisper to detect and transcribe multiple languages
Add OCR support for non-Latin scripts (Japanese, Chinese, Korean)

Model Optimization

Quantize projectors to INT8
Explore smaller LLM backbones (Qwen3-VL-4B)
ONNX export for faster inference

Low Priority / Research

Game-Specific Adapters

Train LoRA variants for specific game genres
Add game state parsers for popular titles

Interactive Training

Human-in-the-loop feedback for improving responses
Active learning for edge cases

Evaluation Benchmarks

Create gameplay video QA benchmark
Metrics for causal reasoning accuracy

References

Kirillov, A., et al. "Segment Anything." ICCV 2023.
Zhai, X., et al. "SigLIP: Sigmoid Loss for Language Image Pre-Training." ICCV 2023.
Tong, Z., et al. "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." NeurIPS 2022.
Baevski, A., et al. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020.
Radford, A., et al. "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv 2022.
Qwen Team. "Qwen-VL: A Versatile Vision-Language Model." arXiv 2023.

License

MIT License

Multimodal Gameplay Video Understanding with Vision-Language Models

Trained Weights

Abs…

Multimodal Gameplay Video Understanding with Vision-Language Models

Trained Weights

Abstract

Project Validation (Verified Capabilities)

Architecture

Perception Pipeline Components

Projection Layer

Fusion and Indexing

Reasoning Core

Installation

Tested Environment

Quick Start (Recommended)

Run Inference

Manual Installation (Alternative)

Project Structure

Usage

Real-Time Inference

Interactive Commands

Feature Extraction

Training

LoRA Fine-tuning

Projector Training

Training Methodology

LoRA Adapter Training

Projector Training

Memory Requirements

Limitations

Future Work

High Priority

Cascaded Processing and Efficiency

Entity-Centric Knowledge Base: Fully utilize the KnowledgeBaseBuilder to ingest structured facts (entity IDs, state changes, bounding boxes) extracted by SAM3, transforming raw detections into explicit causal linkages for the LLM to reason over [A: 142, A: 535].

Medium Priority

Low Priority / Research

References

License

Similar Posts

Entity-Centric Knowledge Base: Fully utilize the `KnowledgeBaseBuilder` to ingest structured facts (entity IDs, state changes, bounding boxes) extracted by SAM3, transforming raw detections into explicit causal linkages for the LLM to reason over [A: 142, A: 535].