TLDR: There is a single, blinking cursor that defines our modern interaction with information. It sits in an empty rectangular box, waiting.
For two decades, search interfaces had clear affordances. Boolean operators, date filters, and “Advanced Search” screens signaled exactly what a database could and couldn’t do.
The new era of Agentic AI search has traded these constraints for infinite potential, replacing the dashboard with a blank text box. But this minimalism is deceptive. It has created a massive usability crisis I call the Blank Box Problem, trapping users in a double bind:
Input Ambiguity: We don’t know how to speak to the machine (Keywords vs. Natural Language vs. Engineered Prompts).
Capability Ambiguity: We don’t know what the machine can actually *…
TLDR: There is a single, blinking cursor that defines our modern interaction with information. It sits in an empty rectangular box, waiting.
For two decades, search interfaces had clear affordances. Boolean operators, date filters, and “Advanced Search” screens signaled exactly what a database could and couldn’t do.
The new era of Agentic AI search has traded these constraints for infinite potential, replacing the dashboard with a blank text box. But this minimalism is deceptive. It has created a massive usability crisis I call the Blank Box Problem, trapping users in a double bind:
Input Ambiguity: We don’t know how to speak to the machine (Keywords vs. Natural Language vs. Engineered Prompts).
Capability Ambiguity: We don’t know what the machine can actually do with our words (Search / Filter / Agentic tasks).
Traditional academic search interfaces were rich in affordances: Boolean operators signalled combinatorial logic, field codes revealed searchable metadata, dropdown menus exposed available filters, and facets signalled post-filter options. Most importantly, nobody in their right mind would expect it to act like a research assistant and try asking it to perform tasks requiring a complicated series of operations.
Then came ChatGPT, Perplexity, and the new era of agentic AI search. The boxes look the same, but the rules of engagement have shattered. The promise is seductive: we can now tell the AI search engine exactly what we want without pressing buttons or learning syntax.
We now face a usability crisis: we no longer know what to input into the search bar because the “Blank Box” hides the mechanics of the machine. Worse, we are caught in a Double Ambiguity:
Input Ambiguity: We don’t know how to speak to the machine (Keywords vs. Natural Language vs. Prompts).
Capability Ambiguity: We don’t know what the machine can actually do with our words (Search vs. Filter vs. Agentic tasks).
As I noted in my last blog post, vague use of terms like “AI-powered search engines” fuels this confusion.
There’s a certain irony here. For decades, librarians lamented that users didn’t understand Boolean operators, truncation symbols, or field codes (or just didn’t care to use facets). Now we’ve built systems that accept natural language—and users are more confused than ever about what they can actually do.
Do I talk like a caveman, a human, or a Sorcerer?
For twenty years, we were trained when using academic databases1 to speak a specific pidgin language: “keyword search”. We stripped away grammar (e.g., “of”), articles (e.g., “the”), and politeness2. We didn’t ask, “Is there an open access citation advantage”. Instead we just typed: open access citation advantage. It was unnatural, but it was deterministic. We knew that if we typed X, the machine would look for X.
Now, the rise of “AI search” or semantic search suggests we might want or need to use other modes, but at this stage of transition it is hard to know when to abandon keyword search.
The current state of AI search forces users to gamble on three different input modes, often with no indication of which triggers the best results:
The Caveman (Keyword Search)3**: **The old reliable. Predictable, typically high precision, the longer the query, but often may fall prey to the “vocabulary mismatch problem”. 1.
**The Colleague (Natural Language): **The promise of “ask me anything.” You talk to the AI like a colleague. But does it understand complex intent, or is it just dropping stop words and extracting keywords from your sentence? 1.
**The Sorcerer (Prompt Engineering): **The magic incantations. To extract value, we are told we must provide context, personas, constraints and even weird incentives that supposedly help the black box of a LLM. e.g. “Act as a senior bibliometrician and find...I will give you $500 if you do well”
Because the interface is opaque, users might be developing folk theories about how to make it work. We add phrases like “take a deep breath” or “think step-by-step” not because we understand the system architecture, but because we hear it works well with GPT-3.5 or other old models - not caring that some of this advice is outdated with reasoning models and more importantly academic AI search is a complicated system of different parts not just a LLM. We have moved from the deterministic logic of Boolean operators to the almost superstitious rituals of Prompt Engineering.
Each mode carries different assumptions about how the system processes input. Keywords assume something like traditional information retrieval—matching terms, perhaps with query expansion. Natural language assumes the system understands semantic intent via embeddings. Prompt engineering assumes the system can follow complex instructions, maintain context, and execute multi-constraint searches.
The core problem is architectural opacity - users have no way to know which mode a given system expects or handles best. A system optimised for keyword matching may perform poorly with verbose natural language prompts. A system designed for conversational queries or one having limited capabilities (see later) may strip out or rewrite carefully crafted constraints. Users may sometimes learn this only through failed searches and inconsistent results4.
Choosing the wrong input mode for a given system can meaningfully degrade results.
Semantic Scholar is a case in point. As I showed in my last blog post, despite the name, its primary retrieval is lexical, not semantic.
Semantic Scholar’s primary retrieval pipeline when you search on the website
If you query it with natural language sentences expecting embedding-based retrieval, you’ll miss most of the relevant literature compared to a keyword search without stop words.
Below shows treating Semantic Scholar like a natural language search, results in only 13 results missing a ton of relevant results as compared to keyword search.
Conversely, using “Caveman” keywords on a system designed for agentic reasoning (like the new Deep Research tools) might not be a total disaster but wastes the machine’s potential to understand nuance and counter-arguments.
If you read all this and are confused (you are not an information retrieval engineer!), you can be sure your users are too.
Even if we figure out how to ask, we hit a second wall: we don’t know what the system can actually do (Capability Ambiguity). When an AI search fails to deliver what users expect, the failure can occur in three distinct ways—and the AI blank box provides no way to determine which.
Search relevance engineer Doug Turnbull has argued that the AI search industry fundamentally misframed the problem by building around vector similarity alone. Because users interact with LLMs in natural language, developers assumed the underlying retrieval should also use natural language similarity—encoding queries as embeddings and finding semantically similar passages on a spectrum.
But this is not always the right abstraction. Users don’t only want semantic similarity. They often want what Turnbull calls selectors or what you can think of as binary buckets - structured filters on specific binary attributes. When a researcher asks for “peer-reviewed papers on CRISPR ethics published after 2020," they’re expressing a query with clear selectors: document type (peer-reviewed papers), topic (CRISPR ethics), date range (post-2020). They expect the system to filter on these attributes or buckets not merely find passages that are "close" in some abstract embedding space.
Some academic AI search tools—particularly earlier versions of Elicit5—relied heavily on vector embeddings for retrieval and ranking with no way to apply pre-filters.
So why not combine both vector embedding retrieval with hard filtering? The issue is combining top-K vector search with attribute filtering (e.g., filtered HNSW) is technically difficult, though newer techniques like ACORN are making it more tractable6.
Today, some AI search tools like Consensus and Elicit offer numerous pre-filter options, while others like Primo Research Assistant and Scopus AI offer far fewer filter options than their “parent” non-AI search indexes.
When a user is faced with a blank chat box with no affordances like pre-filter checkboxes or checklists how is he to know to ask if he can ask in natural language to filter by say Publication year?
What I can say is if you assume the lack of affordances means the system can’t do any filtering - this is not always true.
For example, in Undermind.ai, it can get you papers by publication year by natural language querying but it cannot filter reliably by publication type due to the fact that the underlying index (Semantic Scholar) does not even have this metadata value.
Yet in both cases, Undermind will usually not warn you what you are asking for isn’t supported and try anyway to produce a result without an error message.
Even when systems basic search index does support structured constraints—date ranges, document types, citation thresholds—a separate question arises: does the natural language interpretation layer recognise when users are invoking them?
Consider common search constraints expressed naturally:
“from 2000 to 2010” or “published in the last five years”
“sorted by most cited” or “most recent first”
“only peer-reviewed articles” or “excluding conference papers”
“English language only”
“open access articles”
The underlying search infrastructure might even support all of these filters. But the natural language layer—the part that parses your query7 and decides what to do with it—may or may not recognise these requests.
Consider a user entering a specific request like:
“Show me peer-reviewed articles from 2010 to 2015 about machine learning, ordered by most cited.”
This seems like a clear instruction. However, depending on which AI Research Assistant the user faces, the result is wildly different:
The “Deep Parser” (e.g., Web of Science Research Assistant): This tool is built to understand a ton of specific metadata triggers (if a filter or sort option appears in Web of Science you can likely trigger it using natural language!). It recognizes “2010-2015” not just as text, but as a command to apply a Publication Date Filter. It understands “ordered by most cited” as a specific Sort Function. The AI successfully translates this and many other filters from natural language into structured database queries.
The “Surface Skimmer” (e.g., Primo Research Assistant): This tool might process the same query very differently despite being from the same company. While it can actually “understand” and parse the input to set the publication year ranges but does it know what “peer-reviewed” means as a metadata filter? (Spoiler : it does!). Or will it just look for the phrase “peer-reviewed” in the text? The user has no way of knowing. Also, can it sort by citation counts (Spoiler: it cannot).
More recently, Primo NDE8introduced "natural language search". Here’s what it says
The Natural Language Search feature enables users to formulate queries in normal spoken language, and automatically converts them into the structured format compatible with Primo’s Advanced Search. For example, the user could enter the query, “Find me US history journals in English that are available online,” and the system would create a query with the appropriate criteria for the search.
So far, similar (use of LLM to translate to search strategy) to what we see in Web of Science Research Assistant, Primo Research Assistant, EBSCO Natural Language Search etc.
Not only does the Natural Language Search generate queries from free text, it also identifies certain catch words in the text that can be used to define the scope and automatically select the appropriate filters for the search. For example, if the term “journal” appears in the text, the scope of the search is limited to journals*, *and if a language is specified, the language filter is automatically turned on.
The transformation of the original query into the structured format is performed using generative AI, via ChatGPT 4.1Mini. The elements of the query, such as resource types, date filters, full text preferences, language, and advanced fields like Title or Subject, are classified by the AI system, which uses this information to generate the basic Boolean query and then expands it using related concepts. Ambiguous inputs are handled intelligently by mapping to multiple fields (e.g., both Title and Subject) to ensure broad yet accurate retrieval.
Ah, so it can recognise filters in your natural language query which is invoked by clicking on “Ask Anything” button.
Here’s the problem. Which filters can you use purely by asking in natural language? You won’t know without tons of trial and error or reading the documentation.
I would wager many people might think anything that is a filter in Primo NDE could be invoked by natural language queries. In fact, while a lot of filters and facets can be invoked this way (e.g. resource type, open access, available online, peer-reviewed, held by library, language, creator), some cannot (e.g. subject, collection, etc.).
For example, while testing, I was initially encouraged by the fact that most of the filters in the Primo NDE instance I was testing could be invoked—e.g. the following works correctly:
Find me open access papers on large language models
Find me peer-reviewed papers on large language models
Find me papers on large language models in Spanish
Then I tried to invoke the Subject filter by trying
Find me papers on large language models in <subject as per Primo NDE filter values>
or the sorting function
Find me papers on large language models and sort it by date-newest9
But this time it refused to invoke the subject filter, nor sort by date-newest. I then tried different variants of the query input, etc., to no avail, and I guess it probably wasn’t supported - but you could never by sure.
Of course, when you look at the documentation, you realise it could never have worked as it wasn’t a supported type to filter on.
Because the interface is identical—an empty box—users cannot distinguish between a tool that respects metadata constraints and one that merely performs semantic fuzzy search and always returns the closest top K results even if it ignores the filter constraints.
The third point concerns what operations the system can perform and whether it can combine them.
Consider this research task: “Find me papers referenced by Article X, then identify related papers that could have been cited but weren’t.”
You normally wouldn’t expect this to work in the old days with conventional search but with talk about agentic search, ai-powered research assistants users may expect this to work.
While some agentic systems can indeed do it, this requires running multiple operations in sequence: retrieve the article, extract its reference list, search for semantically related papers, compare the sets, identify gaps. Still, it’s a reasonably simple enough research workflow.
But AI search tools vary dramatically in whether they can execute it. In my testing, “deep research” or agentic search tools like Elicit, Undermind, Scite Assistant, and Consensus failed at this task, despite many having citation searching or parsing capability—probably due to predetermined workflows.
Agentic systems with genuine compositional capability can accomplish this: fetch the article, parse references, run searches, synthesise results—potentially working with a sequence of steps that wasn’t predetermined by the designers.
Systems with fixed workflows like AI2’s Asta (formerly AI2 Paper Finder), Undermind.ai, Elicit, Consensus implement predetermined pipelines: “literature search,” “citation chasing,”, “evaluation of results” or “summarisation” in a fairly fixed sequence. The LLM decides which pre-built flow to invoke, but it cannot compose novel sequences. The operations exist in isolation, not as combinable primitives.
The user staring at the search box has no way to know which category their tool falls into. They may assume the limitation is their own—that they haven’t found the right phrasing—when they’re hitting an architectural ceiling. This is perhaps the cruellest form of capability ambiguity: users engage in elaborate prompt engineering to unlock capabilities that don’t exist.
(For more details, see Deep Research, Shallow Agency: What Academic Deep Research Can and Can’t Do.)
Failures in most AI search are usually silent and indistinguishable from the user’s perspective. A search returning irrelevant results could be failing for any of the following reasons and it is not easy for a user to understand why:
Whether they used the wrong input mode (e.g. using natural language query with a lexical search system)
Whether they phrased their query poorly within the correct mode (e.g. Use of wrong keywords in a lexical search system)
Whether their filters were recognized and applied (e.g. asking in natural language to “filter to review articles” but this isn’t a field the system can filter on or it can do so but the natural language parser isn’t set up to do so).
Whether the system lacks the workflow operations (e.g. do forward citations of paper X) they need to do the task.
Whether those operations exist (do forward citations of paper x) in the system but can’t be composed into their desired workflow
More dangerously, parts of the search could silently fail or be ignored. This is certainly true for embedding systems that simply find the closest matches, LLMs that rewrite your input query (e.g., Primo Research Assistant), and Deep Research tools (e.g., Undermind.ai). You will always get ‘something,’ even when your query is not correctly interpreted!
The most dangerous failure is the one users cannot see. Consider this query I ran in Primo Research Assistant:
Find articles on climate change in 2023 , review articles only and do forward citation search of top 10 papers
It happily ran with no trace of an error despite not being able to filter by review articles, not to mention it definitely couldn’t do things like forward citation. Below shows the result.
The LLM in Primo Research Assistant rewrote the input to a boolean search without the restrictions and ran it. Below is the rewritten query. Notice what it dropped.
(climate change impacts) OR (global warming) OR (climate change adaptation) OR ((climate change) OR (global warming)) OR ((climate change) AND (environmental policy)) OR (”climate change”) OR ((global warming) OR (climate variability)) OR ((climate change) AND (mitigation)) OR ((greenhouse gases) AND (climate impact)) OR (environmental change) OR (climate change)
There is also another subtle failure. Primo Research Assistant can correctly parse query inputs like
Papers on X from 2023 to 2024,
but when the query input is
Papers on X in 2023
was interpreted in Primo Research Assistant as 2023 to 2025 as shown in the applied filter at the bottom of the screen capture!
You might then think - maybe if you just prompt in the right way, you could get it to filter to only 2023 (maybe be adding “Only”)10.
Here lies another problem, even if the user notices that the query input wasn’t followed, it might lead to “prompt thrashing”—users reformulating the same request in increasingly elaborate ways, hoping to unlock functionality that may not exist or may be failing at something query reformulation doesn’t (or can’t ever) address. They blame their query formulation when the actual limitations could, say, be architectural.
Meanwhile, systems that do support complex operations provide no signal that such capabilities are available. A user who never thinks to ask “compare the citation networks of these two papers” will never discover that the system can do it.
Turnbull argues the solution lies in reconceiving the role of LLMs in search. Their power isn’t semantic similarity—it’s query understanding. An LLM can take free text ("suede, geometric couch") and produce a structured query with typed fields: style (geometric), material (suede), classification (Living Room / Seating / Sofas). Each field maps to an appropriate retrieval technique—visual embeddings for style, taxonomic matching for materials, hierarchical classification for categories.
This decomposition serves multiple purposes. It enables precision where precision is possible. It makes the system’s interpretation visible and correctable. And crucially, it provides affordances: users can see how their query was parsed and understand what refinements are available. The continuous similarity score of vector search offers no such guidance.
But most AI search interfaces don’t expose this architecture. They present a blank box and hope the system will figure out which parts of the input are filter requests, which are content queries, and which are workflow instructions and fail silently.
Solutions exist and some are already in use in existing products. I will list some of them from most frequently seen to less frequently seen.
**Sample prompts that demonstrate expectations. It is common practice today to list sample queries below the search box. But **example queries should be chosen with care to address both input mode and capability. Are examples terse keywords or elaborate instructions? Do they demonstrate filter syntax, complex operations, or both? Prompts like “Find highly cited papers on CRISPR ethics from 2020-2024” reveal that dates and citation thresholds are recognised; “Compare the methodology of Smith 2020 and Jones 2021” reveals that document comparison is available.
Hybrid UI: the return of filters. There’s no reason we can’t add pull-down menus and buttons to signal available affordances. But consistency matters—if the interface shows a dropdown for document type, users should be able to request the same filter in natural language.
**Type-ahead that reveals constraints. **As users type “find papers from...” the system could surface suggestions: “...from 2020 onwards,” “...from Nature,” “...from Harvard University.” This signals which constraint types are recognised while guiding users toward phrasings the system handles well.
Explicit constraint confirmation. Before executing a search, the system could display parsed constraints for confirmation: “Date: 2020-2024 | Type: peer-reviewed | Sort: citation count | Topic: CRISPR ethics.” Users see exactly what was recognised, can correct misinterpretations, and learn the system’s vocabulary through exposure.
**LLM query intent detection to block unsupported inputs **Pretty much what it says on the tin. The LLM tries to interpret the intent of each query, if it detects that a query is asking for something the tool isn’t designed to do, it stops.
**Expose the tool layer. **For systems built on protocols like MCP, surface available operations directly: “Search database X” “Retrieve (forward) citations from Paper Y,” “Compare documents,” “Extract methodology sections.” When users see that citation network analysis isn’t among the available tools, they stop blaming their phrasing and adjust expectations—or choose a different system. Systems with fixed workflows should make clear what exactly is happening behind the scenes to help head off misunderstandings.
**Template libraries for common workflows. **Offer structured starting points for workflows: “Find gap in citations,” “Compare methodology across papers,” “Trace influence of seminal work.” Each template pre-configures and exposes the right combination of tools while remaining editable for workflow transparency. See for exampleSciSpace Agent Gallery.
These solutions address symptoms, but the underlying tension may be irresolvable. The promise of natural language search is liberation from formal query languages; the reality is that formal languages encoded precise information about both input expectations and system capabilities that natural language cannot easily replace.
We’re in a transitional period where AI search tools vary enormously across many dimensions—different retrieval abstractions, different filter recognition capabilities, different workflow architectures. Yet they present nearly identical blank text boxes that reveal nothing.
This results in a massive “Cost of Discovery.” In the past, if you learned how to use Scopus, you largely knew how to use Web of Science. Now, every AI wrapper has a unique, hidden architecture with varying capabilities.
*This places an unreasonable burden on users to discover through experimentation what each system expects and what each system can do. *The ambiguities compound: users may not be able to easily systematically test because of so much uncertainty in the different subsystems of the search.
For librarians and information professionals, this creates both challenge and opportunity. The challenge is obvious: how do we teach search skills when the skills required vary by platform in ways that aren’t visible? Traditional database instruction could focus on Boolean logic and controlled vocabularies. AI search instruction must address input mode, retrieval method, filter recognition, and workflow composition—none of which are documented completely and consistently.
The opportunity is that this confusion creates demand for exactly the kind of critical evaluation that information professionals excel at. Mapping the capability landscape of AI search tools—not just feature lists but actual functional boundaries at each level—is work that needs doing. Understanding when embedding retrieval is in play and hence when natural language input is appropriate, when filter triggers are recognised, when workflows can be composed: this technical evaluation distinguishes informed tool selection from marketing-driven adoption or fear-driven resistance.
Until AI search interfaces evolve better affordances, we may need to return to an old practice: demanding that documentation exist for input expectations, filter vocabularies, retrieval mechanisms, and workflow capabilities—and that it be surfaced at the point of need rather than buried in help pages no one visits.
The blank search box feels like an invitation to ask anything. It’s actually a test at multiple levels—of whether you know how to ask, whether you know what you can ask for, and whether you can diagnose what actually doesn’t work when you don’t get what you need.
Most academic databases like Scopus enforced hard boolean logic, followed by ranking of the matched set with TF-IDF/BM25. A very rare few, typically academic web search engines like Google Scholar would still be lexical search but would not be strict Boolean typically employing just ranked retrieval with BM25 without Boolean).
I’ve used the “Caveman” analogy here for keyword search for a bit of fun. It isn’t meant to suggest the technical is outdated - both keyword and semantic search have their points. A “Programmer/Robot” comparison would be more suitable if nested Boolean were used. But we know in reality, most users default to simple keyword searches, dropping the stop words and letting the magic of implicit AND to do the work.
So what should you input? If you know roughly what is going on (and I will talk about how to roughly figure out what is going on under the hood in a future post), here is what you should do.
If your query goes directly to an embedding model for retrieval, natural language works well (better than keyword)—these models are trained on natural language and capture semantic similarity. That said, many modern systems use hybrid retrieval—combining embeddings with keyword-based methods like BM25—so well-chosen terms still matter, particularly for named entities or specialist vocabulary.
As for prompt engineering, I’ve been sceptical about elaborate tactics—particularly those lifted wholesale from papers based on GPT-3.5/4 without independent testing. But that scepticism was grounded in my 2023/2024 understanding. Since then, LLMs have improved considerably through better tool use training and agentic orchestration, and may respond better to complicated prompts, so the picture is less clear.
Still, if an LLM sits between you and the retrieval layer—as in more advanced AI search tools—it will likely rewrite or expand your query before searching. I think elaborate phrasing is probably wasted effort; the system normalises your input anyway. What matters is clearly expressing your information need.
In agentic systems with iterative retrieval, the LLM’s system prompt and workflow architecture dominate. Your exact wording matters less; specifying constraints (date ranges, study types, specific populations) in plain language may help if the system parses them (see later), but this is system-dependent.
Practical advice: Describe what you want in clear natural language. Include specific terms the system must match. Don’t bother with elaborate prompt engineering tactics—either the embedding model won’t understand them, or the LLM will rewrite your input regardless.
Vector similarity of dense embeddings usually involve finding the top K most similar embeddings to the query embedding. With a large enough collection, it may be too slow to do a brute force search so systems employ approximate nearest neighbour (ANN) indices like HNSW (Hierarchical Navigable Small World). But these indices are optimised only for the distribution of the full dataset. For example, HNSW builds a hierarchical graph where nodes (vectors) are connected to their nearest neighbours across multiple layers, with navigation starting at sparse upper layers and refining at denser lower layers. During search, the algorithm greedily traverses edges towards the query vector. However, when filters exclude many vectors, the graph connectivity breaks—the greedy path may dead-end at filtered-out nodes, or the remaining filtered subset forms disconnected islands in the graph structure.
The index doesn’t "know" about filter constraints, so traversal routes optimised for the full dataset fail to efficiently find nearest neighbours within the filtered subset. Post-filtering (retrieve K, then filter) risks returning fewer than K results. Pre-filtering (filter, then search) means either rebuilding indices per filter (impractical) or using the mismatched global index (suboptimal recall). Recent work such as ACORN addresses this by selectively expanding search to second-hop neighbours (neighbours of neighbours) when immediate neighbours are filtered out, maintaining connectivity at the cost of additional distance computations.
Retrieval systems use several approaches to recognise filters in natural language. Rule-based pattern matching is cheap but brittle—it fails on paraphrases. Named entity recognition can tag journals and authors but requires domain-specific training. Modern systems increasingly use LLM-based query decomposition, parsing natural language into structured filters. This is more flexible but introduces new problems: LLMs may hallucinate filter capabilities (easy to verify though), parse similar queries inconsistently, or silently treat filter requests as topical content
Disclosure : Images are generated using Nano Banana Pro.
Primo NDE (New Design Experience) is a new interface refresh of Primo. It is not the same as Primo Research Assistant which is a seperate module which you can access via Primo. Despite that, both Primo Research Assistant and the natural language search feature in Primo NDE use LLMs to translate query input into Boolean.
As noted already, this isn’t an absurd ask, since Web of Science Research Assistant by the same company does support this query!
So far as I can tell, there isn’t a way.