matheusfrancisco's Feed

Transforming Shape Schemas with Composable Property-Graph Queries (Extended Version)

Property graphs may be constrained by schemas that inform both query engines and human users about the shape of valid data, enforcing a contract between data provider and consumer. Composable property-graph queries transform input graphs into output graphs. Then, the question arises of which schema can be expected after one (or several) transformation steps. We investigate how schema constraints can be inferred given an input schema and a transf... Read more ›

🗄️Vector Databases arxiv.org·

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware rout... Read more ›

🗄️Storage Engines arxiv.org·

The Value of Adaptivity in LSM Bloom-Filter Tuning: A Log-Law and a Two-Clock Frontier

Log-structured merge (LSM) trees attach an approximate-membership filter to every run and must split a fixed memory budget across them. The static optimum is known (Monkey); a large systems literature then makes the allocation adaptive, tracking shifting hotness online. We ask a prior question: when is that adaptivity worth its machinery? We give three analytical answers and validate them on synthetic sweeps, real Twitter production cache traces... Read more ›

⚡Concurrency arxiv.org·

A Multimodal Machine Learning Framework for Enterprise Database Workload-Aware Root Cause Analysis

Root cause analysis for enterprise database incidents is often a manual and time consuming process that requires operators to inspect logs, performance metrics, and workload behavior. Existing approaches commonly focus on a single source of evidence, which limits their ability to capture the broader operational context behind incidents such as CPU saturation, I/O bottlenecks, lock contention, deadlocks, and slow query execution. This paper prese... Read more ›

🗃️Databases arxiv.org·

AgenticDB: Agentic Performance Reconfiguration for Database Workloads

Database configuration tuning is critical for workload performance, but practical tuning on real deployments remains difficult. Existing automatic tuners mostly formulate tuning as iterative search over DBMS knob values. This formulation leads to high execution cost, prematurely narrows the configuration space, and leaves practical requirements insufficiently addressed: diagnosing runtime bottlenecks from system feedback, exploring OS-level reco... Read more ›

🗃️Databases arxiv.org·

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructu... Read more ›

🗄️Database Design arxiv.org·

When Does q-error Predict Plan Regret? Three Regimes of Cardinality-Estimation Error

Cardinality-estimation (CE) research ranks estimators by q-error, yet it is well known that q-error is an imperfect proxy for query-plan quality. We give a measurement-driven account of when it is a good proxy and when it is not, and why. Modeling plan selection as an argmin over a piecewise-linear cost landscape, we find that plan regret (the cost of the chosen plan relative to the optimal, under true cardinalities) is governed by plan-cost geo... Read more ›

🧠Transformers arxiv.org·

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale pub... Read more ›

🧮Datalog arxiv.org·

Directory-Aware Query and Maintenance in Vector Databases

Vector databases typically manage metadata as flat scalar attributes, which limits their ability to express hierarchical directory semantics commonly used to organize code repositories, enterprise documents, and agent memories. As a result, directory-scoped retrieval and structural updates are often implemented as application-layer workarounds, making recursive scope resolution expensive and directory maintenance difficult to keep consistent. Th... Read more ›

📊Graph Databases arxiv.org·

From Embedded Properties to Trait Nodes: A Design Method for Identifying Reusable Metadata in Property Graph Schemas

Property-graph schemas often contain descriptive properties that recur across heterogeneous nodes and edges, yet schema designers lack a clear method for deciding whether such properties should remain embedded or be treated as reusable metadata structures. This paper addresses this design-stage problem within a 5GNF-oriented modeling perspective by proposing a method for identifying metadata candidates based on five criteria: cross-element occur... Read more ›

🗄️Vector Databases arxiv.org·

Revisiting Filtered ANN Benchmarks: A Hardness-Controlled Benchmark Generator for Realistic Evaluation

Filtered approximate nearest neighbor (FANN) search must satisfy both vector similarity and structured predicates, yet evaluations remain brittle because real hybrid workloads are rarely shareable and existing benchmarks rely on ad-hoc synthetic or semi-real constructions. We argue that realism hinges on execution-driven query difficulty: failures in early filtering trigger over-fetching of additional candidates, shaping latency, throughput, and... Read more ›

🗄️Database Design arxiv.org·

PLRTune: Importance Pre-Sampling and LLM-Guided Reinforcement Learning for Automatic Database Tuning

Configuration tuning is critical to database performance, yet automatic database tuning remains challenging due to high-dimensional knob spaces, substantial online tuning cost, unreliable textual hints derived from Large Language Models (LLMs) or community documents, and the difficulty of exploiting the remaining optimization room after initialization. Hence, we propose PLRTune, a staged database tuning system that leverages workload-specific do... Read more ›

🗃️Databases arxiv.org·

Improved Join Order Optimization for Database Queries using Hybrid Quantum-Classical Approaches for QUBO Problems

Efficient query optimization is crucial for relational database systems, especially for optimizing join orders in complex queries. This work introduces a hybrid approach that integrates Eliminating Cartesian Products (ECP) with splitting the QUBO search space (SQSS) to reduce the size of the QUBO problem, minimizing binary variables and constraints. This improves the performance of the quantum algorithm while lowering hardware requirements. We e... Read more ›

🗄️Database Design arxiv.org·

Group Commit Self-Clocks: Why Tuning Is Unnecessary Above a Device-Set Load Threshold

Group commit amortizes the fixed cost of a durable log flush across many committing transactions; the release rule - a timer, a batch size, or an adaptive policy - is a classic tuning knob. The textbook theory is open-loop: for Poisson arrivals the optimal timer is the EOQ square-root rule, and the wait-or-flush decision is ski-rental 2-competitive. We ask when that tuning is worth its machinery, and show that in closed-loop OLTP it usually is n... Read more ›

🗄️Vector Databases arxiv.org·

Filtered ANN as a Phase Transition: When Selectivity-Estimation Error Causes Plan Regret

A filtered approximate-nearest-neighbor (ANN) query returns the k nearest vectors among those satisfying an attribute predicate P of selectivity s. The best execution strategy -- pre-filter, post-filter, or in-filter -- changes with s, so a system must estimate s and choose. We model this as an argmax over a landscape with phases (regions where each strategy wins) separated by boundaries, and show that selectivity-estimation error produces plan ... Read more ›