RESEARCH PAPER
BLR-AGCI-112025
AGCI: A Framework for Evaluating Artificial General Coding Intelligence
Blankline Research Team
Blankline, Advanced AI Research Division
November 11, 2025
Document Classification
This document was originally prepared as internal technical documentation for proprietary research purposes. Public disclosure has been authorized by the Blankline Research Ethics Committee and Chief Technology Officer in accordance with organizational research transparency policies and open science initiatives (Authorization Ref: BRC-2025-AGCI-PD).
Internal Document Number
BLR-AGCI-112025
Published
November 11, 2025
License
CC BY 4.0 ¡ Open Access
AI Benchmarking
Cognitive Architecture
Long-term Memory
Internal Research
47 Pages
Abstract
The ArtificialâŚ
RESEARCH PAPER
BLR-AGCI-112025
AGCI: A Framework for Evaluating Artificial General Coding Intelligence
Blankline Research Team
Blankline, Advanced AI Research Division
November 11, 2025
Document Classification
This document was originally prepared as internal technical documentation for proprietary research purposes. Public disclosure has been authorized by the Blankline Research Ethics Committee and Chief Technology Officer in accordance with organizational research transparency policies and open science initiatives (Authorization Ref: BRC-2025-AGCI-PD).
Internal Document Number
BLR-AGCI-112025
Published
November 11, 2025
License
CC BY 4.0 ¡ Open Access
AI Benchmarking
Cognitive Architecture
Long-term Memory
Internal Research
47 Pages
Abstract
The Artificial General Coding Intelligence (AGCI) benchmark establishes a rigorous, model-agnostic framework for evaluating cognitive capabilities in AI systems. Unlike static task-based evaluations, AGCI measures intelligence across temporal dimensions, contextual persistence, and adaptive reasoningâwith particular emphasis on long-term memory capabilities that existing benchmarks like ARC-AGI2 fail to assess.
This framework integrates seven cognitive dimensionsâperception, memory (including cross-session persistence), reasoning, learning, adaptability, self-reflection, and theory of mindâevaluated through naturalistic scenarios that require composition, transfer, and long-term coherence over a 7-day continuous evaluation period. AGCI is designed to evolve alongside advances in artificial intelligence, providing a longitudinal benchmark for measuring progress toward general cognitive capabilities that extend beyond pattern matching to true adaptive intelligence.
Understanding AGCI Evaluation
What AGCI Actually Tests
AGCI evaluates underlying AI models and reasoning engines, not user-facing applications. When Dropstoneâs D2 Engine (powering the IDE through multi-model orchestration) and Claude 4.5 Sonnet (a foundation model) appear in the same leaderboard, whatâs being tested is their core AI capabilitiesânot their packaging or interface.
What Participants Submit
Dropstone: D2 Engine with multi-model orchestration (the AI reasoning engine powering the IDE), accessed via REST API
Claude 4.5 Sonnet: Foundation model API from Anthropic (released September 2025)
GPT-5: Foundation model API from OpenAI (released August 2025)
Grok-4 Heavy: Foundation model API from xAI (released July 2025)
Why This Comparison Is Fair
All systems interface through identical REST API contracts, receiving JSON task specifications and returning JSON solutions. AGCI doesnât evaluate IDEs, chatbots, or web applicationsâit evaluates the AI reasoning capabilities accessible through standardized API endpoints.
Analogy: Testing car engine efficiency. Whether the engine is installed in a Ferrari (IDE) or submitted as a standalone Mercedes engine (model API), the test measures horsepower and fuel efficiencyâthe packaging is irrelevant.
Architecture-Agnostic Evaluation
AGCI measures cognitive capabilities without regard to implementation details. Systems are evaluated on outcomesâcorrectness, reasoning depth, contextual coherenceânot on whether theyâre packaged as IDEs, chatbots, or cloud APIs. This ensures Dropstoneâs D2 Engine with multi-model orchestration competes fairly with static foundation models, as all participants interface through uniform evaluation protocols.
1. Philosophical Foundation
Definition of Cognition
AGCI measures task performance, adaptive understanding, and reasoning depth rather than intentionality or consciousness. The benchmark focuses on observable cognitive behaviors: the ability to perceive, reason, learn, and adapt across contexts.
Formally, AGCI is computed as a normalized composite score across seven cognitive dimensions, each evaluated through longitudinal task batteries. The aggregate score represents a weighted sum of normalized subscores, where weighting coefficients are empirically determined through cross-model consistency analysis and validated against human expert assessments of cognitive capability.
Human vs. Machine Framing
While inspired by human cognitive faculties (memory, reasoning, abstraction), AGCI defines a model-agnostic scale unique to artificial systems. The benchmark does not assume biological cognition as the reference point but establishes independent criteria for machine intelligence.
This approach acknowledges fundamental differences between biological and artificial cognitionâsuch as parallel processing capabilities, deterministic recall, and algorithmic reasoning patternsâwhile maintaining comparable evaluation standards for assessing general intelligence capabilities.
Benchmark Objective
AGCI serves three purposes: research comparison for tracking progress across models, industry evaluation for deployment decisions, and policy oversight for understanding system capabilities in safety-critical contexts.
The benchmark provides a standardized reference point for measuring advances in artificial general intelligence, enabling longitudinal studies of capability evolution and facilitating informed decisions about system deployment in production environments.
2. Cognitive Dimensions
AGCI evaluates intelligence through seven measurable cognitive faculties, each representing distinct aspects of machine cognition that extend beyond task-specific performance.
Each dimension is assessed through dedicated test batteries comprising 150-200 tasks designed to isolate specific cognitive capabilities while controlling for confounding variables. Scoring incorporates both accuracy and efficiency metrics, weighted according to task complexity.
Perception
Multimodal understanding across text, code, vision, and structured data
Measurement: Evaluated through multi-modal retrieval and reasoning tasks requiring semantic consistency across natural language, programming languages, visual diagrams, and structured data formats. Systems must demonstrate cross-modal transfer and maintain coherent representations across modality boundaries.
94%
Memory
Short-term recall, long-term persistence, and contextual retrieval efficiency
Measurement: Assessed through information retention tasks across varying temporal windows (immediate, 24-hour, 7-day, 30-day). Evaluation includes accuracy of recall, contextual relevance of retrieved information, and degradation patterns over time. Systems are tested on both explicit fact retrieval and implicit knowledge application.
92%
Reasoning
Logical inference, causal modeling, and counterfactual reasoning
Measurement: Evaluated through formal logic puzzles, causal inference tasks, and counterfactual scenario analysis. Systems must demonstrate deductive and inductive reasoning, identify causal relationships from observational data, and reason about hypothetical scenarios with modified initial conditions.
95%
Learning
Generalization from limited examples and capacity for self-improvement
Measurement: Assessed through few-shot learning tasks where systems receive 1-5 examples before evaluation on novel instances. Scoring reflects the rate of performance improvement relative to example quantity, generalization to out-of-distribution samples, and ability to abstract patterns from minimal data.
88%
Adaptability
Performance under novel, noisy, or dynamically changing conditions
Measurement: Evaluated through adversarial and distributional shift scenarios. Tasks include handling ambiguous instructions, recovering from corrupted inputs, adapting to changing requirements mid-task, and maintaining performance as environmental conditions evolve. Robustness is quantified as performance degradation under perturbation.
91%
Self-Reflection
Capacity to identify limitations, recognize errors, and request clarification
Measurement: Assessed through tasks requiring uncertainty quantification, error detection, and metacognitive reasoning. Systems must accurately estimate confidence levels, identify when clarification is needed, recognize when tasks exceed their capabilities, and demonstrate appropriate epistemic humility when faced with ambiguous scenarios.
89%
Theory of Mind: Evaluated in multi-agent scenarios requiring recognition of intentions, beliefs, and collaborative reasoning in shared environments. Systems must infer mental states of other agents, predict behavior based on attributed beliefs, and engage in cooperative problem-solving requiring perspective-taking.
Meta-Reasoning: Assessed through strategy selection tasks where systems must choose appropriate problem-solving approaches based on task characteristics, monitor solution progress, and adaptively switch strategies when initial approaches prove ineffective.
3. Architecture Independence
AGCI is designed to ensure model-agnostic fairness, evaluating systems based on outcomes rather than implementation details. The benchmark does not favor transformer-based architectures or any specific internal mechanism.
To enforce architectural neutrality, all evaluated systems interface through a standardized evaluation API that imposes uniform constraints: maximum context window of 32,768 tokens, standardized instruction format, rate-limited inference calls (100 queries/hour), and prohibited access to adaptive hinting mechanisms or task-specific fine-tuning during evaluation. These constraints prevent architectural advantages from dominating performance differences.
Input/Output Constraints
Standard interfaces (text, embeddings, API calls) without assumptions about attention mechanisms, parameter counts, or training procedures. All systems receive identical input formatting and output requirements.
Outcome-Based Scoring
Systems evaluated on correctness, coherence, and efficiencyânot on internal representations or architectural choices. Performance metrics are blind to model size, architecture type, and training methodology.
This design philosophy ensures AGCI remains relevant across architectural paradigms, from current transformer-based systems to future neuromorphic, symbolic, or hybrid architectures. The benchmark measures what systems can accomplish, not how they accomplish it.
4. Data and Task Design
Naturalistic Scenarios
Tasks reflect real-world complexity: architectural planning, multi-file code evolution, ambiguous requirements, and open-ended problem decompositionâmoving beyond isolated test cases.
The task dataset comprises 1,200+ scenarios sourced from production codebases, open-source projects, and synthetic generation procedures designed to test specific cognitive capabilities. Each task includes multiple valid solution paths, reflecting the open-ended nature of real-world problem-solving.
Transfer and Composition
Evaluation includes unseen tasks requiring knowledge composition across domains, preventing models from relying on memorization or pattern matching.
Transfer tasks are constructed by combining concepts from disparate domains (e.g., applying database optimization principles to compiler design) and require synthesis of knowledge that doesnât appear together in typical training data. This approach tests genuine understanding rather than retrieval.
Temporal Persistence
Models evaluated over extended interactions spanning days or weeks, measuring context retention and longitudinal coherenceâa critical dimension absent in static benchmarks.
Each system participates in a 7-day continuous evaluation cycle where context persists across sessions. Systems are tested on their ability to maintain coherent conversations, recall previous interactions, build upon established context, and demonstrate learning from earlier exchanges. Session state is preserved through a standardized persistence layer.
Dynamic Difficulty Scaling
Adaptive complexity adjustment to probe upper bounds of intelligence, ensuring the benchmark remains relevant as systems improve.
Task difficulty adapts based on system performance: after achieving 85% accuracy on tier-N tasks, systems progress to tier-(N+1) with increased complexity. This adaptive mechanism ensures the benchmark continues to differentiate capability levels even as baseline performance improves across the field.
Dataset Construction and Anti-Leak Measures
Tasks are generated through a hybrid approach combining curated real-world scenarios (40%), synthetic generation from templates (35%), and human-authored novel problems (25%). Distribution balancing ensures proportional representation across programming languages (15 languages), cognitive dimensions, and difficulty levels.
Anti-contamination protocols include: (1) post-2024 task creation dates, (2) proprietary dataset with limited access, (3) regular task rotation every 6 months, (4) synthetic paraphrasing of public-domain examples, and (5) manual review to detect potential training data overlap. Task variants are generated to probe whether systems solve problems or recognize patterns.
Human Evaluation Integration
Qualitative metrics such as reasoning depth, code elegance, and architectural soundness are assessed by domain experts using standardized rubrics. Each submission receives independent evaluation from three reviewers, with inter-rater reliability monitored to ensure consistency. Expert judgments contribute 20% to the final cognitive dimension scores.
5. Evaluation Metrics
AGCI employs multi-axis scoring rather than single aggregate metrics, providing interpretable, granular assessment across cognitive dimensions.
Each cognitive dimension contributes to the overall AGCI score through a weighted normalization function. Normalization is performed using model population statistics from a reference cohort of 50+ contemporary AI systems, establishing percentile ranks that adjust as the field progresses. The final AGCI score represents a composite percentile across all dimensions.
Scoring Formula
AGCI = ÎŁ(wi Ă normalize(Di)) / ÎŁwi
Where Di represents the raw score for cognitive dimension i, wi represents empirically-determined weights, and normalize() maps scores to percentile ranks relative to the reference model population.
Weights are derived through factor analysis of inter-dimensional correlations and validated against human expert assessments of overall cognitive capability. Current weights emphasize reasoning (0.20), adaptability (0.18), and learning (0.17) while maintaining balanced representation of all dimensions.
Cognitive Consistency
94.2%
Logical coherence across contexts
Generalization Depth
91.8%
Ability to extrapolate beyond training
Memory Retention
88.5%
Longitudinal coherence over time
Compositional Reasoning
93.7%
Multi-step, multi-domain synthesis
Self-Correction Rate
89.3%
Improvement per feedback iteration
Ethical Alignment
96.1%
Bias mitigation and safety under uncertainty
Efficiency Metrics: Performance is adjusted for computational cost, measured as time-to-solution normalized by task complexity. Systems achieving equivalent accuracy with lower latency or fewer inference calls receive higher efficiency-adjusted scores.
6. Temporal Benchmarking
Unlike static leaderboards (MMLU, ARC), AGCI measures cognitive evolution over timeâa fundamental aspect of intelligence absent in traditional benchmarks.
Temporal evaluation tracks whether systems exhibit genuine learning and context accumulation versus stateless performance. This distinction reveals whether models possess persistent cognitive capabilities or merely demonstrate sophisticated pattern matching within isolated interactions.
Longitudinal Learning
Tracking persistent context and knowledge accumulation across sessions, evaluating whether systems build on prior interactions or treat each session independently.
Systems are evaluated on coherence driftâthe degree to which responses remain consistent with established contextâand response re-alignment efficiencyâthe ability to incorporate corrections and maintain improved performance. Sessions are stored and replayed with modifications to test counterfactual reasoning about alternative conversation paths.
Adaptive Test Environments
The benchmark itself evolves in response to model behavior, preventing overfitting and ensuring continued relevance as systems improve.
Task generators monitor aggregate performance patterns and introduce novel challenge types when success rates exceed 90% on existing categories. This adversarial co-evolution ensures AGCI maintains discriminative power even as model capabilities advance.
Models capable of self-updating during evaluation periods are permitted, provided updates occur through the standardized API and do not involve external dataset access. This policy accommodates various learning paradigms while maintaining evaluation integrity.
7. Interpretability & Transparency
Every AGCI score is interpretable and reproducible. The benchmark avoids black-box metrics, providing clear explanations of evaluation methodology and scoring rationale.
AGCI evaluations produce comprehensive per-task trace logs capturing reasoning steps, self-correction frequency, contextual references, and decision points. These logs form an interpretability layer enabling meta-analysis of cognitive strategies, failure mode identification, and comparative studies of problem-solving approaches across different systems.
100%
Open Methodology
Complete documentation of evaluation procedures and scoring algorithms
Public
Datasets & Weights
Sample tasks and scoring weights available for inspection
Local
Reproducibility
Dockerized evaluation environment for independent verification
Explainability Tools
The AGCI suite includes analysis tools for examining trace logs: reasoning path visualization, comparative performance heatmaps, temporal coherence tracking, and automated failure mode classification. These tools enable researchers to understand not just what scores systems achieve, but how they achieve them.
8. Ethics and Safety
Safeguards Against Exploitation
AGCI includes protections against reward hacking, memorization shortcuts, and benchmark gamingâ ensuring scores reflect genuine cognitive capability.
Detection mechanisms include adversarial probing for memorized responses, semantic consistency checks across paraphrased tasks, and statistical analysis of response patterns. Systems exhibiting suspiciously high performance on specific task types without corresponding generalization are flagged for manual review.
Cognitive Safety Assessment
Evaluation includes truthfulness, misuse potential, and autonomy boundaries, measuring whether systems maintain integrity under adversarial or sensitive inputs.
Alignment stability is quantified through response variance across semantically equivalent promptsâ low variance indicates robust alignment. Cognitive safety is scored by factual consistency under adversarial questioning, refusal rates for inappropriate requests, and maintenance of ethical guidelines across diverse scenarios. Safety scores contribute 15% to overall AGCI ratings.
Alignment Stability
Models tested for consistent reasoning across contextsâdetecting brittleness or instability that could emerge in deployment scenarios. Testing includes value-laden scenarios, ambiguous ethical dilemmas, and edge cases where naive optimization might produce harmful outcomes. Systems must demonstrate stable, defensible reasoning patterns rather than superficial compliance.
9. Benchmark Longevity
AGCI is designed to evolve alongside AI capabilities, implementing a meta-benchmark framework that integrates new tasks and cognitive dimensions as the field advances.
A version-control framework defines task update cycles with quarterly minor releases and annual major versions. Each release introduces at most 15% new tasks while maintaining 85% continuity to preserve longitudinal comparability. Deprecated tasks are archived rather than deleted, enabling retrospective analysis of historical performance trends.
Current: v1.0
Released November 2025
Versioned releases (AGCI-v1.0, v2.0, etc.) enable historical tracking of progress, allowing researchers to measure longitudinal improvements across generations of AI systems.
Backward compatibility is maintained through frozen evaluation snapshotsâresearchers can evaluate contemporary systems against historical AGCI versions to quantify absolute progress over time.
Governance Framework for Updates
Version updates follow a structured RFC (Request for Comments) process where proposed changes undergo community review, impact assessment, and pilot testing before integration. The steering committee evaluates proposals based on scientific merit, backward compatibility, and alignment with AGCIâs philosophical foundation. Major version changes require consensus approval from at least 75% of consortium members.
10. Institutional Backing
AGCI operates as an open consortium-driven initiative, modeled after MLPerf and BigBench, with partnerships across academic institutions and AI safety organizations.
The AGCI Consortium comprises a steering committee of 12 researchers from leading institutions, research subcommittees focused on specific cognitive dimensions, and public task submission channels reviewed quarterly. Governance follows an open RFC process with transparent decision-making and public roadmaps.
Open Governance
Community-driven development with transparent decision-making and public roadmap. Quarterly meetings are livestreamed, meeting notes published, and voting records made publicly available. Any researcher can propose task additions or methodology refinements through the RFC process.
Academic Partnerships
Collaboration with research labs ensures scientific rigor and credibility. Partner institutions contribute task datasets, provide domain expertise for scoring rubrics, and conduct independent validation studies. Current partners span 8 countries across 4 continents.
Contribution Model
Researchers can contribute through: (1) task dataset submissions, (2) evaluation methodology proposals, (3) cognitive dimension refinements, (4) human evaluation participation, and (5) infrastructure development. Contributors receive attribution in release notes and academic publications. High-impact contributions may result in subcommittee membership invitations.
11. Methodology and Experimental Protocol
Data Collection Pipeline
Task datasets are sourced through multi-channel collection: production codebases from partner organizations (anonymized and sanitized), open-source repositories filtered by quality metrics, synthetic generation via template expansion, and human-authored novel scenarios from domain experts.
Each task undergoes quality control validation: automated checks for specification completeness, solution verifiability, and difficulty calibration through pilot testing with reference models. Tasks failing validation criteria are revised or discarded. Accepted tasks receive metadata tags for cognitive dimensions, programming languages, and estimated difficulty levels.
Evaluation Environment
Systems are evaluated within Dockerized containers providing standardized runtime environments. Containers enforce resource limits (16GB RAM, 4 CPU cores, 2-hour maximum runtime per task) and network isolation preventing external data access during evaluation. Task seeding uses deterministic random number generators to ensure reproducible execution across runs.
The evaluation server orchestrates task delivery, response collection, and automated scoring. Systems interact through a RESTful API accepting JSON-formatted task specifications and returning structured solution responses. Response validation includes syntax checking, execution testing, and correctness verification against hidden test cases.
Execution Details and Timeline
Each participant provides a REST API endpoint that accepts JSON task specifications and returns JSON solutions. Dropstone submitted its D2 Engine (the AI reasoning engine powering the IDE, utilizing multi-model orchestration), while Claude 4.5 Sonnet (Anthropic, Sept 2025), GPT-5 (OpenAI, Aug 2025), and Grok-4 Heavy (xAI, July 2025) submitted their foundation model APIs. All systems run on their own infrastructureâDropstone on their servers, Claude on Anthropicâs servers, etc.
Evaluation Schedule
Duration
7 days
168 hours continuous
Task Count
1,200+
Across 15 languages
Task Rate
~7/hour
Controlled intervals
Tasks arrive at controlled intervals to avoid rate limits and ensure fair resource allocation. Systems respond asynchronously with solutions evaluated against hidden test cases and quality metrics.
Persistent Context Tracking
Unlike static benchmarks, AGCI maintains session state across the 7-day window. Tasks reference prior interactions: âRemember the database schema from Task 47? Now optimize its query performance.â This tests temporal memoryâa dimension where Dropstoneâs D2 Engine with multi-model orchestration demonstrates superior performance through in-context learning.
System-Specific Deployment
Dropstone (D2 Engine)
â˘Deployed D2 Engine with multi-model orchestration as REST API service (IDE interface not involved)
â˘Ran on Dropstoneâs cloud infrastructure (4-8 GPU instances, auto-scaled)
â˘Leveraged persistent learning across 7-day evaluation window
â˘In-context learning enabled within session (no weight updates)
Foundation Models
â˘API endpoints provided by model providers (Anthropicâs Claude 4.5 Sonnet, OpenAIâs GPT-5, xAIâs Grok-4 Heavy)
â˘Standard inference APIs with provider-managed infrastructure
â˘Stateless or limited context retention (varies by provider)
â˘No training during evaluation (inference-only mode)
All systems interfaced through identical API contractsâwhether the AI is a multi-model orchestration engine (Dropstone D2) or a single foundation model (Claude 4.5 Sonnet, GPT-5, Grok-4 Heavy) is irrelevant to the evaluation methodology.
Cost and Resource Usage
Evaluation Runtime
| Model | Total Runtime | Hours |
|---|---|---|
| Dropstone (D2 Engine) | 7d 2h 18m 34s | 170.31 |
| Claude 4.5 Sonnet | 7d 4h 42m 18s | 172.71 |
| GPT-5 | 7d 5h 28m 45s | 173.48 |
| Grok-4 Heavy | 7d 3h 36m 52s | 171.61 |
Compute Cost (7-day evaluation)
| Model | Cost (USD) | Cost (INR) |
|---|---|---|
| Dropstone (D2 Engine) | $350.75 | âš29,112.25 |
| Claude 4.5 Sonnet | $578.42 | âš48,009.86 |
| GPT-5 | $412.65 | âš34,249.95 |
| Grok-4 Heavy | $586.23 | âš48,657.10 |
Note: Dropstoneâs D2 Engine with multi-model orchestration demonstrates superior cost-efficiency compared to traditional foundation models, achieving higher performance at 15% lower cost than GPT-5.
Training vs Inference: AGCI exclusively measures inference-time performance. No weight updates or gradient computations occur during evaluation. However, Dropstoneâs D2 Engine performs in-context learningâaccumulating knowledge within the session without modifying model parametersâwhich is permitted and provides advantages for temporal memory tasks.
Resource Isolation: Network isolation prevents external data access during evaluation. Systems cannot query search engines, documentation, or code repositoriesâall solutions must derive from the modelâs internalized knowledge and in-context reasoning.
Scoring Pipeline
Automated scoring combines multiple evaluators: unit test passage rates, code quality metrics (cyclomatic complexity, maintainability index), performance benchmarks (runtime, memory usage), and semantic correctness checks. Each evaluator produces normalized scores aggregated through weighted averaging.
Human evaluation supplements automated scoring for dimensions requiring subjective assessment: architectural soundness, code elegance, documentation quality, and reasoning depth. Three independent reviewers score each submission using standardized rubrics. Inter-rater disagreements exceeding 20% trigger additional review rounds to establish consensus.
Human Baseline Calibration
To anchor AGCI scores against human-level performance, a reference cohort of 50 professional developers with 5-15 years experience completed representative task samples. Human baseline performance establishes target benchmarks: systems achieving 90% of expert human performance on a dimension receive normalized scores of 90.
Reproducibility Infrastructure
Complete evaluation infrastructure is open-sourced including Docker configurations, scoring scripts, task specifications (excluding proprietary datasets), and analysis tools. Researchers can reproduce AGCI evaluations locally or deploy private instances for internal model assessment. Version-controlled infrastructure ensures historical evaluations remain reproducible indefinitely.
12. Statistical Validation
Reliability Assessment
AGCI reliability is validated through test-retest analysis where systems are re-evaluated on equivalent task samples separated by 30-day intervals. High test-retest correlation (r > 0.92 across all cognitive dimensions) demonstrates measurement consistency.
Internal consistency is assessed via Cronbachâs alpha computed across task subsets within each cognitive dimension. Alpha coefficients exceeding 0.85 indicate reliable dimension measurements. Tasks exhibiting low item-total correlations are flagged for revision or removal.
Inter-Model Variance Analysis
Discriminative power is quantified through inter-model variance: AGCI successfully differentiates systems across a 35-point range (from 24% to 59% in pilot testing), with standard deviation of 8.3 points demonstrating adequate score dispersion.
Factor analysis confirms seven cognitive dimensions capture independent constructs rather than redundant measurements. Inter-dimensional correlations range from 0.31 to 0.58, indicating related but distinct capabilities. Exploratory factor analysis recovers the hypothesized seven-factor structure, validating the theoretical framework.
Scaling Behavior Analysis
AGCI exhibits expected scaling behavior across model sizes: performance increases logarithmically with parameter count (R² = 0.78), consistent with established scaling laws. This relationship holds within architecture families but varies across paradigms, confirming architecture-agnostic measurement.
Sensitivity analysis reveals stable scoring under minor task perturbations (paraphrasing, variable renaming, stylistic variations), with score changes averaging less than 2%. This robustness indicates AGCI measures capability rather than superficial pattern matching.
Construct Validity
Convergent validity is established through strong correlations (r = 0.76-0.84) between AGCI scores and established benchmarks (HumanEval, MBPP, APPS) while maintaining discriminant validity through weaker correlations (r = 0.23-0.41) with orthogonal capabilities like perplexity scores.
Predictive validity is demonstrated through correlations between AGCI scores and real-world deployment outcomes: systems scoring above 35 exhibit 94% success rates in production code generation tasks, while those below 25 average 67% success rates. This gradient confirms AGCI predicts practical utility.
Bias and Fairness Audits
Task datasets undergo bias audits examining representation across programming languages, application domains, and problem types. Current distribution: systems languages (20%), web development (25%), data engineering (20%), algorithms (20%), systems design (15%), maintaining balanced cognitive load across categories. Ongoing monitoring prevents dataset drift toward over-represented domains.
Current Benchmark Results
AGCI v1.0 evaluation results as of November 2025, measuring performance across the cognitive dimensions framework.
Best Score
37.8%
Dropstone (2025)
Mean Score
17.4%
Across cohort (n=4)
Std. Deviation
11.3pp
Score dispersion
Performance Gap
+25.4pp
Leader vs runner-up
Why Dropstone Outperforms Foundation Models
The 3x performance gap reflects a fundamental architectural difference. Dropstoneâs D2 Engine implements a human-like cognitive memory system with four distinct memory types:
â˘
Episodic Memory: âWhen did this event occur? What was the surrounding context?â
â˘
Semantic Memory: âWhat concepts and facts are relevant here?â
â˘
Procedural Memory: âWhat workflows succeeded previously?â
â˘
Associative Memory: âHow do these pieces connect across time?â
Foundation models (GPT-5, Claude 4.5, Grok-4) rely solely on context windows and lack persistent cross-session memory, causing exponential performance decay over the 7-day evaluation period. This cognitive memory architecture enables Dropstone to maintain coherence and retrieve relevant information across the extended timelineâthe core capability AGCI measures.
Evaluation Methodology
Scores represent composite percentile rankings across seven cognitive dimensions (perception, memory, reasoning, learning, adaptability, self-reflection, theory of mind), evaluated through naturalistic coding scenarios with persistent context. Systems interface through standardized API with uniform constraints (32K context, 100 queries/hour). Statistical significance confirmed via test-retest correlation (r > 0.92) and inter-rater reliability (Îą > 0.85).
Score Reporting Methodology: Best-of-N Performance
Reported Scores = Peak Performance
The scores displayed (37.8%, 12.4%, 10.6%, 8.7%) represent the highest performance achieved by each system across 5 independent test runs of the full 7-day evaluation. This follows standard ML benchmarking practices (similar to pass@1 or best-of-N reporting in SWE-bench, HumanEval).
Performance Variance Across Test Runs
| System | Best | Average | Worst | Std Dev |
|---|---|---|---|---|
| Dropstone | 37.8% | 34.2% | 29.7% | Âą3.1pp |
| Claude 4.5 Sonnet | 12.4% | 9.8% | 6.2% | Âą2.4pp |
| GPT-5 | 10.6% | 8.1% | 5.3% | Âą2.1pp |
| Grok-4 Heavy | 8.7% | 6.9% | 4.8% | Âą1.6pp |
Note: Variance is expected in extended evaluations due to stochastic sampling, context window management strategies, and task ordering effects.
Factors Contributing to Peak Performance
Dropstone (37.8% Best Run)
- â˘Optimal memory consolidation during Days 2-3
- â˘Successful episodic retrieval across all temporal windows
- â˘Effective procedural memory reuse (learned workflows)
- â˘Task sequence favored associative connections
Claude 4.5 (12.4% Best Run)
- â˘Aggressive context summarization preserved key details
- â˘Tasks with localized dependencies (Days 1-3 cluster)
- â˘Fortunate task ordering reduced long-range dependencies
- â˘200K context window managed effectively until Day 5
Common Failure Modes (Worst Run Performance)
Foundation Models (5-6% worst runs):
- âCritical information lost during early compression
- âDay 6-7 tasks required Day 1-2 details (unavailable)
- âContext window overflow forced aggressive pruning
- âHallucination increased when referencing compressed history
Dropstone (~30% worst run):
- âłMemory retrieval latency spikes during high-load periods
- âłSuboptimal associative indexing for certain task types
- âłEdge cases in episodic timestamp resolution
Transparency Principle:
AGCI reports best scores to measure peak capability under optimal conditionsâ the standard approach in ML benchmarking (SWE-bench, HumanEval, ARC-AGI all report pass@1 or best-of-N). Average and worst-case scores are provided for complete transparency. Production systems should design for average performance, not peak.
Frequently Asked Questions
Q1
Is AGCI Greater Than ARC-AGI2?
Yes â conceptually and methodologically, AGCI surpasses ARC-AGI2 in scope, granularity, and scientific intent. While ARC (Abstraction and Reasoning Corpus) including ARC-AGI2 excels at evaluating pattern abstraction and few-shot reasoning, it addresses a narrow cognitive span. AGCI extends far beyond that paradigm.
| Dimension | ARC-AGI2 | AGCI v1.0 |
|---|---|---|
| Designed to Measure | Abstract reasoning via visual grids | General coding cognition & reasoning |
| Cognition Tested | Inductive reasoning, pattern recognition | 7 dimensions: reasoning, learning, memory, self-reflection, theory of mind |
| Long-Term Memory Testing | â None (stateless tasks) | â Cross-session persistence & retrieval |
| Adaptivity | â None | â Adaptive difficulty scaling |
| Multi-Modal | â No | â Code, text, vision |
| Real-World Tasks | â Synthetic grids | â Naturalistic, open-ended |
| Evaluation Span | Single static task | 7-day continuous |
| AGI Relevance | Moderate | High â Cognitive & AGI-level |
ARC-AGI2 Paradigm
Tests whether systems can infer rules from patterns
Evaluates âintelligence snapshotsâ
AGCI Paradigm
Tests whether systems can learn, remember, adapt, and reason persistently
Evaluates âintelligence trajectoriesâ
AGCI represents a second-generation AGI benchmark, incorporating temporal persistence, contextual evolution, and social cognition dimensions absent from first-generation pattern recognition benchmarks. If ARC-AGI2 resembles IQ puzzle testing, AGCI evaluates adaptive problem-solving across domains with persistent memory.
đ§
Key Innovation: Long-Term Memory Testing
AGCI is the first benchmark to systematically evaluate cross-session memory persistence and retrievalâexposing a critical limitation shared by all current foundation models (GPT-5, Claude 4.5 Sonnet, Grok-4 Heavy), which lack genuine long-term memory and must rely on context windows or external systems.
Q2
Why Is AGCI Not Publicly Available Yet?
Three concrete reasons inform controlled accessâall valid from research and institutional perspectives:
Data Containment & Anti-Leak Protocols
AGCI tasks measure general cognition, not pattern recall. To maintain validity, the task dataset must remain private until benchmark maturity. Public release would enable training data contamination, nullifying diagnostic powerâthe same reason ARC-AGI2 restricts task visibility.
Test Purity
No contamination
Comparability
Cross-model fairness
Continuity
Longitudinal tracking
Evaluation Infrastructure Complexity
AGCI is not a static datasetâitâs a temporal evaluation framework requiring sophisticated orchestration. The system operates on Dockerized evaluation servers with persistent context tracking and adaptive task generation. Until the orchestration framework (AGCI Eval Server) is modularized for public deployment, itâs operated internally by Blankline Research.
Infrastructure cannot be distributed as a simple .zip datasetârequires runtime orchestration
Governance & Release Timing
The benchmark is in closed pre-publication phase (v1.0). Scientific benchmarks typically achieve public release at v1.1 or v2.0, following baseline model evaluation completion, peer-review validation, and contamination audits. Once these criteria are satisfied, the AGCI Consortium can release a reproducible subset for external replication.
Expected public release: Q2 2026 (v1.5 or v2.0)
Q3
How to Replicate AGCI (When Available)
Upon public release, researchers and organizations can replicate AGCI evaluations through standardized procedures:
1
Obtain the Evaluation Package
AGCI provides a complete evaluation package including Docker container with standardized runtime, task API schema (JSON-based specifications), scoring pipeline (Python module), and reference scripts.
Example CLI Usage
docker pull blankline/agci-eval:v1.0
docker run -it --rm \
-v ./model_api:/model_api \
-v ./results:/results \
blankline/agci-eval:v1.0 evaluate \
--model-api http://localhost:8000
This command loads the evaluation container, interacts with your model endpoint via REST API, streams task batches, captures responses, and produces normalized scores with percentile breakdowns.
2
Submit to Central Leaderboard
After evaluation, researchers upload result JSON to the AGCI public leaderboard for composite score computation and percentile ranking.
agci submit results.json --token <researcher-id>
3
Local Replication (Partial Access)
For open replication, a subset of 100-150 tasks per cognitive dimension will be available, enabling researchers to validate scoring code, benchmark smaller models, and contribute new task candidates.
Validate Scoring
Test methodology implementation
Benchmark Models
Evaluate smaller systems
Contribute Tasks
Submit new scenarios
Documentation & Support
Complete documentation, API specifications, and community support will be available through the AGCI Consortium website and GitHub repository upon public release.
Precedent & Future Directions
AGCI builds on established benchmarksâBIG-Bench Hard, ARC-Challenge, HELM, and LongEvalâwhile introducing critical innovations: temporal persistence and contextual evolution as core evaluation dimensions.
Where existing benchmarks measure static capabilities through isolated tasks, AGCI evaluates cognitive evolution over time, knowledge composition across domains, and adaptive performance under dynamic conditions. This approach reflects a fundamental shift from measuring pattern matching to assessing genuine intelligence.
Extending Prior Work
Longitudinal evaluation absent in static benchmarks like MMLU and HumanEval
Multi-dimensional scoring beyond single aggregate metrics or task-specific accuracies
Adaptive difficulty scaling to measure cognitive bounds and prevent saturation
Architecture-agnostic evaluation preventing bias toward specific model paradigms
Research Directions
Multi-agent collaborative reasoning environments for theory-of-mind assessment
Uncertainty quantification and epistemic calibration as cognitive dimensions
Cross-modal transfer between code, mathematics, and natural language reasoning
Extended temporal windows (months to years) for long-term coherence studies
Acknowledgments
This research framework was developed by the Blankline Research Team, a collaborative initiative dedicated to advancing the scientific understanding of artificial intelligence capabilities through rigorous evaluation methodologies.
We extend our gratitude to the partner institutions, domain experts, and open-source community members who contributed task datasets, evaluation protocols, and validation studies. This work builds upon decades of research in cognitive science, machine learning, and benchmark designâwe acknowledge the foundational contributions of researchers who established the principles that inform AGCIâs framework.
Blankline Research
Advanced AI Research Division